The machine beat humans in reading tests, but did they read it?

This article is from the public number:Neural reality (ID: neuroreality), author: PAVLUS, title figure from: vision China.

Autumn 2017, computational linguist Sam Bowman of New York University) (Sam Bowman) believes that computers are still not good enough to understand words. Of course, in some very specific areas, computers have been able to simulate textual understanding well, such as machine translation and sentiment analysis (eg Determine if a sentence is good or malicious). But Bowman hopes to get evidence of measurable machine comprehension from tests of real articles: Real, human-style English reading comprehension(Capability). Therefore, he designed a well-known test.

In a paper published in April 2018, Bowman collaborated with researchers at the University of Washington and Google’s artificial intelligence company DeepMind to propose a project called GLUE (General Language Understanding Evaluation) series of tests consists of nine reading comprehension tasks.

The test is “a representative part of the high-hardness challenges recognized by researchers,” Bowman said, although “this is not a problem for humans.” One of the tasks is to determine whether a sentence is correct based on the previous sentence, for example, if you can infer from the “President Trump arrived in Iraq and began a one-week state visit” Puzheng is visiting overseas, then you passed the test.

As a result, the test results of the machine were terrible. Even the most advanced neural network did not score more than 69 points in any task, or more than D+. In this regard, Bowman and his collaborators are not surprised. The neural network roughly mimics the connection patterns of neurons in the mammalian brain to build computational connectivity layers, although in natural language processing (NLP) This area has shown great potential, but researchers do not believe that these systems have learned anything about the nature of language. And GLUE seems to prove this too. “These initial performances indicate that existing training models and methods do not allow the machine to pass GLUE.Bowman and his collaborators wrote in the paper.

But their evaluation was quickly broken. In October 2018, Google introduced a new training model, nicknamed BERT (Bidirectional Encoder Representations from Transformers, converter output Two-way encoder representation). It scored 80.5 points in the GLUE test. Designers hope that this new benchmarking tool can help measure the machine’s true understanding of natural language or expose the machine’s lack of it. Now, the machine has been upgraded from D- to B-level in just six months.

“This is definitely a time to call you out of the ‘slot’!” Bowman recalls. “The BERT has scored close to humans in several missions, and the industry is generally skeptical. Before the BERT appeared, the GLUE test did not even set a human benchmark score. When Bowman and one of his doctoral students joined the human benchmark score in February 2019, a BERT-based system developed by Microsoft overtook them in just a few months.

— Simon Prades

When I wrote this article, every system on the GLUE headline scoring list was almost optimized on the basis of BERT. Many of the systems outweighed humans in scoring.

But does this mean that the machine really started to understand our language, or does it just learn to better cope with our tests? When the BERT-based neural network overcomes GLUE as a blast, new test methods emerge.In these new tests, the powerful NLP system was portrayed as “smart Hans”(machine-defined rules

In the well-known thought experiment “Chinese room”, a person who does not speak Chinese is in a room, and he has many Chinese grammar books at hand. These grammar books detail how to give a corresponding answer to a series of Chinese characters received. When someone outside the room stuffed a piece of note from the door, the note was written in Chinese. The person in the room could view the grammar book and send a perfect Chinese answer.

This thought experiment shows that no matter how accurate the outside people think, the people inside can’t say that the people inside understand Chinese. However, even if the machine only simulates human understanding, it is not easy in the difficult field of natural language processing.

The only problem now is that these so-called perfect grammar books don’t exist. Natural language is too complex and arbitrary, and it is difficult to revert to a series of strict rules, such as the syntax (that is, the experience/rules that govern the words that make up meaningful sentences) Span>. The famous linguist Noam Chomsky once gave an example: “The colorless green concept is angry and falls asleep.”( Colorless green ideas sleep furiously) This sentence is syntactically correct, but any speaker knows that this is a meaningless nonsense. Natural language has countless rules that can only be said to be unspeakable. So, what kind of pre-written grammar book can cover it all?

NLP researchers are trying to make the impossible possible. They train the neural network generation pro with a method called “pre-training”.The “grammar book”.

Before 2018, one of the pre-training tools for the NLP model was something similar to a dictionary. The tool called “word embedding” encodes the links between words into numbers and uses it as an input to train deep neural networks, which is equivalent to giving a very simple vocabulary to people in the “Chinese room.” However, the word-embedded pre-trained neural network still can’t understand the meaning of the word from the sentence level, “it will feel ‘one person biting the dog’ and ‘a dog biting the person’ Expressing exactly the same meaning.” John Hopkins’s computational linguist Tal Lin’s

Tal Linzen—Source: Will Kirk/Johns Hopkins University

A better training method is to apply a pre-training that covers vocabulary, syntax, and context to make the neural network have a richer “grammar book” and then train it to perform specific NLP tasks. In early 2018, researchers from OpenAI, the University of San Francisco, the Allen Institute of Artificial Intelligence, and the University of Washington also discovered a clever way to approach this ambitious goal. They began to train the entire neural network with a more basic task called “language model”, which was different from the previous neural network using only word embedding to pre-train the first layer of neurons.

“The simplest language model is: read some words first, then try to predict the next word that will appear. If I say ‘George Bush was born’, this model needs to predict the next word in the sentence. “Facebook research scientist Mel Ott (Myle Ott)This explains.

In fact, these deep pre-trained language models are relatively difficult to construct. Researchers only need to copy a large amount of text from an open source database such as Wikipedia, enter a grammatically correct sentence consisting of hundreds of millions of words into the neural network, and then let it derive its own prediction of the next word. In fact, this is like letting the person in the “Chinese Room” refer to the Chinese information sent in, and write his own grammar rules.

Aote said: “The advantage of this training method is that the language model actually learns a lot about syntax.”

A neural network pre-trained in this way can perform other more specific NLP tasks with a richer representation, the specific application process being called fine tuning.

“From pre-training, you can adapt a language model to perform the tasks you want to perform,” Ott explained. “Compared to the direct training model to perform detailed tasks, using pre-trained adaptation models can be obtained. Better results.”

In June 2018, OpenAI introduced a neural network called GPT, which included a pre-trained language model with 11038 e-books approaching one billion words. The neural network directly occupied the top of GLUE at that time with a score of 72.8. Even so, Sam Bowman believes that the performance of any machine is far from human level in reading comprehension.

After that, the BERT appeared.

Powerful “recipe”

So, what exactly is BERT?

First of all, it is not a fully trained neural network, nor can it directly transcend human level, but Bauman’s “a very accurate pre-training program”< /strong>. Google researchers have developed BERT, which allows the neural network to learn how to accomplish various NLP tasks. It is like having a pastry chef make a pre-baked pie crust according to the recipe and then use it to make various Various pies (blueberry filling or spinach stuffing). At the same time, Google has opened up the BERT code, so that other researchers do not have to build this “recipe” from scratch, just package and download the BERT, just as simple as buying a pre-made pie crust in the supermarket.

If BERT is like a recipe, what raw materials does it need? “There is three reasons for the success of the entire model.” Facebook research scientist Omar Levi . He is committed to studying the reasons for the success of BERT.

Omer Levy—Source: Omer Levy

First, there must be a pre-trained language model, which is the grammar book in the “Chinese Room.” Second is the ability to identify the primary features in a sentence.

In 2017, Google brain engineer Jacob Uzkolet (Jakob Uszkoreit) is working hard to strengthen Google’s machine understanding language Competitiveness. He noticed that even the most advanced neural networks are constrained by a built-in feature: it reads the words in the sentence one by one from left to right. On the surface, this “sequential reading” seems to mimic the pattern of human reading, but Uzkolet suspects that “linear, sequential understanding of language may not be the optimal way.”

So, Uzkolet and his collaborators designed a new architecture for neural networks. The core of this architecture is the “attention” allocation mechanism, which is to let each layer of neurons add weight to certain features in the input. This “attention” architecture is called the converter (transformer). It encodes each word in the sentence “A dog bites that person” into a number of different forms. For example, a converter can treat “bite” and “person” as verbs and objects, ignoring “one”The word; at the same time it can also use “bite” and “dog” as verbs and subjects, basically ignoring the word “that”.

The converter represents sentences in this non-sequential form as a more expressive form. Uzkolet calls this non-sequence form a tree(treelike). Each layer of neurons in a neural network represents multiple parallel connections between words. Two non-adjacent words are often connected together, just like a tree diagram drawn by a primary school student when distinguishing sentence components. “This is actually like a lot of overlapping tree diagrams,” Uzkolet explained.

The representation of the tree allows the converter to model the contextual understanding well, while also better learning the connections between two words that are far apart. “It seems counterintuitive,” Uzkolet said. “But the model is based on linguistics, which has been studying the tree model of language for a long time.”

Finally, the third “raw material” in the BERT formula is more biased towards this non-linear reading.

Unlike other pre-trained language models, BERT does not rely on reading a large amount of text from left to right to train the neural network, but reading from both left and right directions , Learn how to predict the words that are hidden in the middle. For example, the input that BERT received was “George Bush 1946…in Connecticut.” It would parse the text from both left and right directions, predicting the word “birth” in the middle of the sentence. Uzkolet said: “This two-way reading allows the neural network to learn as much information as possible from any combination of words before prediction.”

The pre-training task used by BERT to fill in the blanks is called shadowing language model(masked-language modeling) . In fact, for decades, we have been using it to assess human language understanding. Google uses it as a tool for training neural networks for two-way reading, replacing the most popular one-way reading pre-training model in the past. “Before BERT, the one-way reading language model was an industry standard, even though it brought unnecessary restrictions.” Google research scientist Kenton Lee ( Kenton Lee) said.

Before the advent of BERT, these three “raw materials”—depth pre-training language models, attention mechanisms, and two-way reading—have already appeared. But no one ever thought of combining them until Google released the BERT in late 2018.

Jakob Uszkoreit—Image Source: Google

Improved “recipe”

Good formula always attracts people! BERT has been improved into their own style by various “chefs”. In the spring of 2019, “Microsoft and Alibaba constantly revised the BERT model and surpassed each other to rotate the top spot on the GLUE rankings,” recalls Bowman. In August 2019, BERT’s advanced version of RoBERTa was on the stage. At the time, DeepMind researcher Sebastian Ruder (out).”

The BERT “pie crust” includes a series of decision-making mechanisms for structural design, each of which has an impact on the effects of the model. This includes the size of the “baked” neural network, the size of the pre-trained data set, the way the dataset is masked, and the training duration of the neural network. The later “recipes”, including RoBERTa, are the result of researchers adjusting these decision-making mechanisms, just as chefs will continue to improve dishes.

Training RoBERTa as an example, Facebook andResearchers at the University of Washington have added a number of new materials (more pre-training data, longer input sequences, more training time), removed a (such as “predict next sentence”, this task was originally included in the BERT, but later found that it will reduce the performance of the model), another (increased difficulty in masking language pre-training tasks). What is the training result? Directly take the GLUE ranking first. Six weeks later, researchers at Microsoft and the University of Maryland made a modified version of their RoBERTa and struggled to win again. At the time of this writing, there is another model called ALBERT – the full name is “a lightweight BERT” (A Lite BERT) — With a finer adjustment training, its performance has topped the GLUE.

“We are still trying to figure out which ‘recipes’ will work and which will not,” Ott said.

However, even if we no longer improve the skills of making pies, it is unlikely to understand the chemistry. Continuous optimization of BERT will not give us any theoretical knowledge of natural language processing. “To tell you the truth: I am not very interested in these papers because I think they are particularly boring,” said Lin Biao, a computational linguist at Johns Hopkins University. “This is a scientific puzzle,” he admits, but the answer is not how to make BERT and other models smarter, or why they become smarter. Instead, “we are trying to understand to what extent these models really understand the human language,” rather than “finding some of the tricks that happen to be useful in the data sets tested,” Lin said.

BERT does keep getting high scores in training, but in other words, what if its solution is wrong?

The machine is smart, but not smart

In July 2019, two researchers at the National Cheng Kung University in Taiwan trained BERT to complete a more difficult natural language understanding benchmark task: to demonstrate reasoning understanding. BERT has achieved impressive results.

This task requires the subject to choose an appropriate argument that supports the argument.premise. For example, if the argument that “scientific research shows the link between smoking and cancer” is to be used to argue that “smoking causes cancer”, we need to assume that “scientific research is credible” rather than “scientific research is expensive”. The latter may be the correct statement in itself, but there is no meaning in this discussion. Machine, read it?

If it doesn’t, it doesn’t matter. Because even our humans can hardly pass this task completely: an untrained person has an average benchmark score of only 80 points.

And BERT scored 77 points – the author of the article concealed that the result was “surprising.”

This, but they don’t think that BERT training allows neural networks to acquire Aristotle’s logical reasoning skills. Instead, they give a simpler explanation: BERT only summarizes the surface mode of implied premises.

This is the case. After reanalysing the training data set, the researchers found evidence of the existence of false prompts. For example, simply selecting the option that contains “no” can do 61% of the questions. After erasing these false hints, BERT’s score dropped from 77 points to 53 points – almost as much as a person. Also, the machine learning magazine Gradient (Right for the Wrong Reasons), Lin Biao and him The co-authors published a series of evidence that BERT’s excellent performance in certain GLUE tasks may also be “following” false hints in the data set. This paper includes a new data set specifically to expose the “shortcut” used by BERT in GLUE. The data set is called “Heuristic Analysis of Natural Language Inference Systems” (Heuristic Analysis for Natural-Language-Inference Systems (HANS)< /p>

So, BERT and its “brothers””Sister” is a scam?

Baumann and Lin Biao believe that part of GLUE’s training data is confusing, full of subtle bias introduced by the creators, and all these deviations may be exploited by powerful BERT-based neural networks. “There is no trick that allows the neural network to complete all the tasks of (GLUE), but there are many shortcuts to improve its performance. “You can go,” Bowman said. “And BERT can find these shortcuts.” But at the same time he doesn’t think BERT is just a watch. “It seems that these models have indeed learned some about language,” he said. “But it certainly does not understand English in a comprehensive and stable way.”

Cai Yizhen, a computer scientist at the University of Washington and Allen Institute, Yejin Choi, believes that the goal of truly understanding the language Continued development, people can not only focus on the development of more powerful BERT, but also need to design better test benchmarks and training data, in order to reduce the possibility of the machine to take shortcuts. She researched a method called “anti-filtering”(Anna Rogers) said. No matter how well the dataset is designed and carefully selected, it can’t cover all extreme situations and unpredictable input, but we humans can deal with them without any difficulty when using natural language.

Baoman pointed out that it is hard to imagine nervesHow the network makes us believe that it really understands the language. After all, standardized testing should reveal generalizable information about the nature of the candidate’s knowledge. But as everyone who has tested the SAT knows, these tests will be overwhelmed by techniques. Bowman said: “We want to design tests that are difficult enough and not to be deceived. Solving “they” can lead us to believe that we have fully mastered one aspect of artificial intelligence or language research technology, but it is very difficult.”

Baumann and collaborators recently launched a new test called SuperGLUE. This test is especially difficult for the neural network of the BERT system. Until now, no neural network has made more than human performance. But if this happens, does this mean that the machine really understands the language better? Or does it just mean that science can better teach machines to cope with our tests?

“There is a good analogy,” Bowman said. “Even if we know how to do LSAT and MCAT, we don’t necessarily work for doctors and lawyers.” And this seems to be artificial intelligence research. The road he has traveled, he added. “Before we figured out how to write a chess move program, everyone felt that chess is a real test of intelligence,” he said. “The goal of our time must be to ask more difficult questions to test the machine. Language understanding, and then constantly thinking about how to solve these problems.”

https://www.quantamagazine.org/machines-beat-humans-on-a-reading-test-but-do-they-understand- 20191017/

This article is from the public number: neural reality (ID: neureality) , author: PAVLUS

domeet webmaster