This article comes from the WeChat public account: the heart of the machine (ID: almosthuman2014) , author: Almost Human, the subject map from the IC photo

At the World Artificial Intelligence Conference at the end of August last year, Shen Xiangyang, then Microsoft ’s global executive vice president, officially announced the Mahjong AI “Suphx” developed by Microsoft Research Asia. Recently, all technical details about Suphx have been officially announced.

After Go, Texas Hold’em, Dota, and StarCraft, the “Suphx” of Microsoft Asia Research Institute created another leapfrog breakthrough in AI in the gaming field-Mahjong.

Mahjong has always been regarded as a very challenging field in AI research due to its complicated card playing, scoring rules, and rich hidden information. Liu Tieyan, deputy dean of Microsoft Research Asia, once said: “It can be said that games like Dota are more‘ game ’, while chess and card games like Mahjong are more‘ AI ’.

Suphx represents the best achievement of the AI ​​system in the field of mahjong. It is also the first AI system to be promoted to ten stages on the internationally renowned professional mahjong platform “Tianfeng”, Its strength surpasses 99.9% of the human players that the platform has played against .

Not long ago, the Microsoft Mahjong AI research team published Suphx’s paper for the first time on arXiv, and more technical details behind Suphx were also made public.

Link to the paper: https://arxiv.org/abs/2003.13590 < / p>

Method overview

In the paper, the researchers created an AI system Suphx suitable for 4-player Japanese Mahjong (short for Super Phoenix, meaning Super Phoenix) , Which uses a deep convolutional neural network as a model. First of all, according to the logs of human professional players, they train the network through supervised learning; then use the network as a strategy, through self-play reinforcement learning (RL) Realize network enhancement. Specifically, researchers use the popular strategy gradient algorithm for self-play reinforcement learning, and propose a global reward prediction (global reward prediction) , Oracle guiding and pMCPA three technologies to solve some of the known challenges:

  • Global reward prediction is used to train a predictor based on the current and previous rounds of information