Recently, the paper published by DeepMind in the Nature issue was seriously questioned. Researchers from Huawei’s UK R&D Center attempted to experiment with DeepMind’s approach and said that the computational power required for the paper could not be achieved.

Editor’s note: This article is from WeChat public number “machine heart” (ID: almosthuman2014) , Participation: Yiming, Du Wei.

DeepMind has a high academic reputation in the field of reinforcement learning. From AlphaGo to AlphaStar, every research has achieved world-renowned achievements, but just recently, a paper on DeepMind’s multi-intelligence intensive learning was “faced” by the Huawei Research Center. Huawei paper pointed out that there are many problems in this research of DeepMind.

The researchers believe that if you want to reproduce the papers that DeepMind has recently published in the “Nature” issue, need to use up to one trillion dollars of computing power, which is impossible to achieve with all the computing power in the world. .

So, what is DeepMind’s research, and what is the problem with Huawei’s paper?

  • Huawei UK R&D Institution Paper: https://arxiv.org/abs/1909.11628

  • DeepMind Paper: https://arxiv.org/pdf/1903.01373.pdf

The algorithm exhausted the global GPU computing power can not be achieved, DeepMind Alpha series was stunned by Huawei, Zeng Deng Nature sub-publication

Bedding DeepMind Papers

As a new member of the DeepMind “Alpha” family, alpha-Rank launched the Nature Supplement “Nature S” in July this year.Scientific Reports. Researchers say that α-Rank is a new dynamic game theory solution that has been validated on scenes such as AlphaGo, AlphaZero, MuJoCo Soccer, and Poker, and has achieved good results.

The cost of computing for Huawei papers (in US dollars) is shown in Figure 2, which takes into account that the NVIDIA Tesla K80 GPU can operate at a single precision of $0.90 per second up to 5.6 GFlop/s.

The algorithm exhausted the global GPU computing power can not be achieved, DeepMind Alpha series was stunned by Huawei, Zeng Deng Nature sub-publication

Figure 2: Cost of constructing the transformation matrix T when calculating α-Rank.

Please note here that the current global computer computing power is about $1 trillion (red plane). The projection outline indicates that it is impossible to perform multi-agent evaluation with more than 10 agents due to the exponential growth of α-Rank “input” computing power.

Finally, in the paper, Huawei researchers proposed a solution to α-Rank, named: α^α-Rank. This method uses a stochastic optimization strategy that can greatly reduce computational complexity.

α-Rank principle

α-Rank is an intensive learning study by DeepMind that focuses on multi-agent reinforcement learning scenarios. Reinforcement learning is a way to use the agent to explore in the search space and give appropriate rewards based on the strategy of its choice, so that it gradually converges to the best strategy. Unlike general reinforcement learning, multi-agent reinforcement learning has multiple agents, and multiple agents interact with the environment to create a much more complex situation than a single agent.

In a multi-agent system, each agent gets rewards by interacting with its environment, learning to improve its strategy, and getting the best strategy for action in that environment. In single agent reinforcement learning, the environment in which the agent is located is stable. However, in multi-agent reinforcement learning, the environment is complex and dynamic, so it will inevitably bring many difficulties to the learning process.

The simplest form of MARL is independent reinforcement learning (independent RL, InRL), where each learner ignores other agents and uses all interactions as part of their own (“local”) environment. In addition, there are many agents and environments and research that interacts with each other. Agents need to collaborate with each other to form a joint strategy. To evaluate the strategy of agent selection, you need to evaluate the joint strategy.

Therefore, there are two major difficulties in the evaluation and learning of scalable multi-agent reinforcement learning strategies. First, the joint strategy space (that is, the sum of the policies of all agents) grows rapidly as the number of agents increases. Secondly, this multi-agent game is likely to evolve into a cyclical behavior of “stone scissors cloth”, making it difficult to evaluate the quality of the strategy. In order to solve the second problem, many multi-agent reinforcement learning studies can only transform the agent research into the game theory method, and evaluate it according to the fixed score obtained from the final game result.

Recently, DeepMind has proposed a method called α-Rank on the task of solving multi-intelligence reinforcement learning. This is a multi-agent collaborative assessment solution based on graphs and game theory. α-Rank uses Markov Conley Chains to represent the dynamics of the game and tries to calculate a fixed distribution. The ranking of the joint strategy is generated by distribution.

Specifically, this paper by DeepMind transforms the problem of multi-agent transformation into a fixed distribution of Markov chains. Assuming there are N agents, each agent has k strategies, the Markov chain can be defined as a joint strategy map, with  algorithm exhaustion of global GPU computing power can not be achieved, DeepMind alpha  The series was screamed by Huawei, and the transition matrix of the src= . The fixed probability distribution ν∈R^k^N to be calculated is used to solve Tν=ν. The quality function of v is the ranking score of the joint strategy. The highlight of this approach is to use the multi-agent joint strategy as a fixed distribution for ranking and evaluation.

The algorithm exhausted the global GPU computing power can not be achieved, DeepMind Alpha series was stunned by Huawei, Zeng Deng Nature sub-publication

Figure 1: There are 3 agents. a) Each agent has 3 policies (differentiated by color) and 5 copies. Each agent cluster has a Pi value that measures its chosen strategy; b) when a mutation strategy (red stars) occurs; c) each group chooses to maintain the original strategy, or choose a mutation strategy.

In α-Rank, the strategies of N agents are evaluated by mutation and selection. Initially, the Agent Cluster builds a copy of multiple learners and assumes that all agents in each cluster will execute the same fixed policy. In this way, α-Rank will simulate the multi-agent game environment by randomly sampling the learners in each cluster. At the end of the game, each participating agent gets a benefit, which can be used for strategy mutations and choices. Here, the agent faces a probabilistic choice—changing into a mutation strategy, maintaining the original strategy, or randomly selecting a new strategy that is different from the first two. This process continues with the goal of deciding a major evolutionary approach and spreading it across all clusters of agents.

Rebuttal reason

The reason for the rebuttal of Huawei papers is mainly based on the computational complexity of α*-*Rank. α-Rank claims to be able to solve problems in polynomial time according to the number of agents. However, Huawei Paper believes that the actual complexity will increase geometrically with the number of agents, which is actually an NP-hard problem.

The computational complexity of alpha-Rank is too high

The original α-Rank study claims that its algorithm is solvable because as the number of joint strategies increases, its algorithm can be completed in polynomial time. According to this definition, if α-Rank has a polynomial complexity, the calculation time should be commensurate with the formula: O (N × k)^d, (d and N (the number of agents), and K (the number of strategies)). If the algorithm requires the calculation of a fixed probability distribution with a k^N row and column transition matrix, then the time complexity should be O(k^N). Obviously, this result is geometric and therefore unsolvable. The researchers of the Huawei paper believe that the process of calculating the highest joint strategy in α-Rank is an NP-hard problem.

From the above computational complexity study, one conclusion can be drawn. If a fixed probability distribution is calculated according to the α-Rank method, there are ε fixed strategies, and the accuracy parameter ε is greater than 0, which can be calculated by various algorithms. The computational complexity is shown in Table 1 below. Any existing method of calculating this fixed probability distribution will exhibit geometric complexity growth due to the number of agents.

Algorithm exhaustion of global GPU computing power can not be achieved, DeepMind Alpha series was screamed by Huawei, Zeng Deng Nature's publication

Table 1: Comparison of time and space complexity when using N (number of agents) × K (number of policies) as input.

The definition of alpha-Rank is unclear

In addition to computational complexity issues, Huawei papers discuss the input of α-Rank. DeepMind’s paper gives the results of the complexity calculations of these agents and declares their solvability. However, what Huawei Paper wants to clarify is that such definitions do not reflect the true underlying time complexity without formal definition input, so it is difficult to claim the solvability of these agents.

To this end, the Huawei paper cited an example of a travel salesman problem. The travel salesman needed to visit a series of cities while returning to the original city in the shortest route. Although everyone knows that the travel salesman problem is an NP-hard problem, according to the α-Rank idea, this problem can be reduced to the polynomial time (linear, as solved) of the “meta-city” scale. This is not a problem. A valid statement.

Huawei paper pointed out that even if it can be said that the number of arrays can be determined to solve the problem of travel salesman in polynomial complexity, this does not mean that any similar algorithm is solvable. Even if the algorithm can solve the problem in polynomial time, its space is geometric-scale, which does not mean that it can be solved. Therefore, to solve the complexity problem, you need to adjust the input.

One trillion calculations can’t hold it

Under the above problems are not clearly resolved, Huawei papers can only consider the input of α-Rank as an exponential income matrix according to speculation. Next, they conducted an experiment to calculate the cost of the extended evaluation of only the third line in Algorithm 1, and also considered DeepMind in another paper “α-Rank: Multi-Agent Evaluation by Evolution”. Task.

The algorithm runs out of global GPU computing power can not be achieved, DeepMind Alpha series is roared by Huawei, Zeng Deng Nature publication

Huawei paper calculates the cost of the scalability assessment for line 3 of the α-Rank algorithm 1.

In addition, the total amount of floating-point operations required to construct T in Equation 2 is The algorithm exhausts the global GPU computing power, and the DeepMind Alpha series is screamed by Huawei. Zeng Deng Nature's publication .

The algorithm exhausted the global GPU computing power can not be achieved, DeepMind Alpha series was stunned by Huawei, Zeng Deng Nature sub-publication

Formula 2

For the construction of T in Equation 2 above, the cost of the calculation in Huawei papers (in US dollars) is shown in Figure 2, which takes into account that the NVIDIA Tesla K80 GPU can be $0.90 per second, up to 5.6 GFlop. /s runs under single precision.

The algorithm exhausted the global GPU computing power can not be achieved, DeepMind Alpha series was stunned by Huawei, Zeng Deng Nature sub-publication

Figure 2: Cost of constructing the transformation matrix T when calculating α-Rank.

Please note here that the current global computer computing power is about $1 trillion (red plane). The projection outline indicates that it is impossible to perform multi-agent evaluation with more than ten agents due to the exponential growth of α-Rank “input” computing power.

It is also worth noting that the analysis of Huawei papers does not consider the cost of storing T or calculating the smooth distribution, so their analysis is optimistic.

In addition, if the alpha-Rank input is added to the income matrix and AlphaZero is run according to the DeepMind paper’s experiment, it will take more than 5200 years to use all of the world’s computing power.

The algorithm exhausted the global GPU computing power can not be achieved, DeepMind Alpha series was stunned by Huawei, Zeng Deng Nature sub-publicate

Other algorithms are also not feasible – even if the return of the income matrix to α-Rank runs DeepMind, the cost and time required for several well-known algorithms are all estimated by Huawei researchers. Astronomical figures. Note: All of the world’s computing power is preset here.

Huawei proposes an improved method α^α-Rank

Huawei adopted a stochastic optimization method in his paper, which obtained a solution by randomly sampling the yield matrix without storing an exponential size input. Contrary to the memory requirements in Table 1, the complexity of this method is O(Nk), and the complexity of each iteration is linear. It’s worth noting that most other methods need to store an exponential size matrix before starting any digital instructions. Although theoretically does not lead to the reduction of time complexity, Huawei papers use double-oracle heuristics to extend its algorithm, and thus achieve space reduction under the joint strategy. In fact, experiments in Huawei papers show that α^α-Rank can converge to the correct top-level strategy in hundreds of iterations of large strategic space.

The algorithm exhausted the global GPU computing power can not be achieved, DeepMind Alpha series was stunned by Huawei, Zeng Deng Nature sub-publicate

Huawei’s proposed improvement.

Huawei’s papers show that its α^α-Rank is scalable and can successfully achieve optimal strategies in driverless car simulation and Ising model, a setting with tens of millions of possible strategies. They noticed that the performance of current SOTA methods is far from meeting the needs of these scales. α-Rank thinks 4 wisdomThe energy body can use up to 4 strategies. All the experiments in the Huawei paper were only run on a single machine with 64GB of RAM and a 10-core Intel i9 CPU.

The algorithm exhausted the global GPU computing power can not be achieved, DeepMind Alpha series was stunned by Huawei, Zeng Deng Nature sub-publicate

Figure 5: Large-scale multi-agent assessment. (a) The convergence of the optimal joint strategy combination in the driverless simulation; (b) The equilibrium state of the Ising model.

The cover image is from pexels. The original title “Your algorithm runs out of global GPU computing power can not be achieved, DeepMind Alpha series is screamed by Huawei, once registered in Nature”