This article is from the WeChat public account: qubits (QbitAI) , author: Q farming, the original title: “Dota2 champion OG how crushed AI? OpenAI’s cumulative three-year full paper is finally released, “Photo by Stackie Jia Unsplash

Team OG, Dota2 World Championship team.

In front of artificial intelligence OpenAI Five, OG is vulnerable. A team of five humans had previously lost 0: 2 without any doubt. The two games were added together and OG only pushed out two outer towers.

However, this is not the pinnacle of AI.

Now OpenAI has trained a brand new AI called Rerun. Faced with the crushing OG OpenAI Five, Rerun’s win rate reached … uh … 98%.

After hearing the news, a Twitter netizen posted Tu Mingzhi.

Mostly relying on self-learning, you can dominate a complex game like Dota2. How does artificial intelligence do it? Today, there are answers on this.

Yes, OpenAI not only released Rerun, but also officially published their research on the Dota2 project for more than three years through a paper.

In this paper, OpenAI explains the whole system’s principles, architecture, calculation volume, parameters and many other aspects. OpenAI pointed out that by increasing the batch size and total training time, the calculation scale has been expanded, which further shows that today’s reinforcement learning technology can reach levels beyond humans in complex e-sports games.

These studies can be further applied to various zero-sum games where two opponents continue.

(Maybe after reading) OG team tweeted: “Wow! This paper looks great!”

This situation, some netizens sighed without affection: Wow! OG team exaggerated a good paper? See you so long …

What does this thesis say? We have summarized several points.

Point 1: Dota2 is more complicated than playing Go

Esports games are more complicated than chess games.

The key to solving this problem is to expand the scale of the existing reinforcement learning system to an unprecedented level, which took thousands of GPUs and months. OpenAI has built a distributed training system for this purpose.

One challenge in training is that the environment and code are constantly changing. In order not to start from scratch after each change, OpenAI has developed a set of tools that can resume training without loss of performance. This set of tools is called surgery.

Each round of Dota2 matches is about 45 minutes long, and 30 frames of game screens are generated every second. OpenAI Five makes an action every 4 frames. Chess is about 80 moves in one game, and Go is about 150 moves in the next game. For comparison, Dota2 plays one game, and AI needs to “play” about 20,000 moves.

Because of the fog of war, the two players in Dota 2 can only see the local situation in the whole game, and the other parts of the information are hidden.

Compared to AlphaGo, the AI ​​system for Dota2 has a batch size that is 50-150 times larger, the model is 20 times larger, and the training time is 25 times longer.

Key Point 2: How AI Learns to Play Dota2

Humans play Dota2 to make decisions in real time with keyboard, mouse, etc. As mentioned earlier, OpenAI Five makes an action every 4 frames, which is called a timestep. During each timestep, OpenAI will receive data such as blood volume and location.

The same information is received by humans and OpenAI Five in completely different ways.

When an artificial intelligence system issues an action instruction, it can be thought of like this.

Behind AI is a set of neural networks. policy (π) is defined as a function from observation data to action probability distribution, which is an RNN neural network with 159 million parameters. This network is mainly composed of a single-layer, 4096-unit LSTM.

The structure is shown below:

LSTM contributes 84% ​​of the parameters in this model.

The players ’training uses the extended version of the near-end strategy optimization (PPO) method, which is also the default reinforcement learning of OpenAI now Training methods. The goal of these agents is to maximize the exponential decay and sum of future rewards.

In the process of training strategy, OpenAI Five does not use human game data, but through self-game. On the issues of Go and Chess, similar training is also applied.

Among them, 80% of the opponents are clones with the latest parameters, and 20% of the opponents are clones with the old parameters. After every 10 iterations, the newly trained clone is marked as the old predecessor. If the currently training AI defeats a rookie or an old predecessor, the system will update the parameters based on the learning rate.

According to the OpenAI CTO’s previous statement, before defeating OG, OpenAI Five has practiced the equivalent of 45,000 Dota. The daily training volume of AI is equivalent to 180 years of human games.

Point Three: Calculations and Hyperparameters

Training such a complex AI system will definitely consume a lot of resources.

OpenAI estimated the GPU consumption for optimization. The final conclusion is that OpenAI Five’s GPU computing usage is around 770 ± 50 ~ 820 ± 50 PFlops / s · days. Today, the new and stronger Rerun, in the following two months of training, the GPU computing consumption is about 150 ± 5 PFlops / s · days.

Again, OpenAI announcedIs just the amount of calculation used for optimization, only a small part of all the overhead in training, which accounts for about 30%.

Before, OpenAI also revealed that the daily training of OpenAI Five requires 256 P100 GPUs and 128,000 CPU cores.

As for the hyperparameters of the entire neural network, in the paper, OpenAI said that when training Rerun, it has further simplified the hyperparameters based on experience. In the end, they changed only four key hyperparameters:

• Learning Rate • Entropy penalty coefficient • Team Spirit • GAE time horizon

Of course, OpenAI also said that there is room for further optimization of these hyperparameters.

Key point four: Not all self-study

Finally, there is one more point to emphasize.

OpenAI clearly stated in the thesis that in the process of learning Dota2, the AI ​​system does not rely entirely on reinforcement learning and self-learning, and the enlightenment also uses some human knowledge. This is different from later AlphaGo Zero.

Some game mechanics are scripted programs. For example, the order in which heroes buy equipment and learn skills, the control of messengers, and so on. OpenAI stated in the paper that there are some historical reasons for using these scripts, as well as cost and time considerations. However, the thesis also states that these can also be completed through self-study.

Battle Review

Finally, let ’s review the OpenAI Five companyLet ’s beat OG in the next two games.

First round

AI (Tianhui) : musket, airplane, ice lady, death prophet, Sven

Humans (night owl) : Mavericks, witch doctors, poison dragons, hidden spikes, shadow magic

After selecting the lineup, OpenAI Five believes that it has a win rate of 67.6%.

Just started, OpenAI Five took the blood, and the Human Legion quickly killed the Ice Girl on the AI ​​side. Since then, the two sides have been on par in terms of head count. AI has always maintained an overall economic lead, but the richest hero has always been the human brother’s shadow monster.

This also shows the obvious differences in the strategies of the two sides: OG is a traditional human play with 3 cores and 2 assists, and the 5 heroes of AI are relatively evenly distributed, which is more “big pot rice”.

After several intense advances and team battles, the game progressed to about 19 minutes, and the AI’s prediction of its own win rate has exceeded 90%. The self-confidence of AI attacked the high ground of human beings in one go.

OG then chose to advance separately. Several commentators speculated that this was to disperse AI as much as possible to prevent them from advancing in groups, but it did not work for long.

However, persisting until 38 minutes, the human calf has just bought live, and the last wave of AI’s total offensive has pushed off the human base.

OpenAI Five won the first game. At the scene, there was applause.

In this game, AI showed a clear idea: choose two major medicines when going out, and follow-up equipment is more inclined to buy supplies instead of improving its attributes.

In addition, the “cauldron rice” policy we mentioned earlier, as well as frequent purchases in the early stages of the game, are very different from the habits of human professional players.

Second round

AI (Tianhui) : Ice Lady, Airplane, Sven, Witch Doctor, Poison Dragon

Humans (night owl) : musket, calf, death prophet, little merman, ryan

After selecting the hero, AI’s prediction of its own win rate is 60.8%, which is slightly lower than the lineup in the previous round.

Two minutes before the game, the two sides took their respective lanes in a peaceful scene, but it was not expected that Topson, a human mid laner, quickly gave out a blood.

After that, human representatives defeated at an alarming rate.

In 5 minutes, AI ’s confidence has risen sharply, predicting that it has an 80% win rate; in 7 minutes, AI has eliminated a tower on the road; in 10 minutes, AI has already been ahead of human 4000 gold coins, pushing two more The tower also predicted a 95% win rate for itself.

In 11 minutes, the AI ​​has reached the high ground of OG.

In just 21 minutes, OG’s base was pushed off, and OpenAI Five easily won the second round. Until the end of the game, OG won the single digit and was beaten by the AI ​​to 46: 6.

Although this winIt was very easy, but during the game, we can still see that AI has some shortcomings in details. For example, in the face of humans walking around in complex forests, AI is powerless. In today’s game, Ceb saved his life by walking around the woods.

This article is from the WeChat public account: qubits (QbitAI) , author: Q farming