This article is from WeChat public No: brain polar body (ID: unity007) , author: Orca

Prediction is unknown, and it has always been a long-awaited ability of human beings. Far from talking about the Zhouyi gossip familiar to Chinese people, the “Back Push” written by Taoists in the Tang Dynasty, as well as astrology that is familiar to Westerners, and the tarot cards that became popular in the Middle Ages. Recently, for example, according to the “2012 Doomsday” The national enthusiasm and business carnival under the influence of Maya’s prophecy still keeps us fresh.

Now the age of “don’t ask Cangsheng and ask ghosts and gods” has passed, and our deterministic, empirical, and even probabilistic predictions of the physical world and socioeconomics are familiar. But, for example, the highly complex, ultra-multivariate and ultra-large data predictions described by the “butterfly effect”, are humans helpless?

The answer is not.

Recently, the outbreak of a new coronavirus in Wuhan in China has drawn close attention from the World Health Organization and health institutions around the world. Among them, “Wired” magazine reported that “a Canadian company BlueDot was the first to predict and release an epidemic in Wuhan through an AI monitoring platform”, which has received widespread attention from domestic media. This seems to be the result we most want to see in the “predicting the future”- With the help of big data precipitation foundation and AI inference, it seems that human beings are able to figure out “God’s will” and reveal the original hidden in chaos. The law of cause and effect among them tries to save the world before the natural disaster.

Today, we will start with the prediction of infectious diseases and see how AI is moving step by step to a “magic machine”.

Google GFT shouts “Wolf is coming”: Rhapsody of flu big data


UseAI predicts that infectious diseases are obviously not the patent of Bluedot. In fact, as early as 2008, today’s AI “powerhouse” Google has made a less successful attempt.

In 2008, Google launched a predicting system for flu—Google Flu Trends (Google Flu Trends, hereinafter referred to as GFT) . GFT became famous in World War I a few weeks before the outbreak of H1N1 in the United States in 2009. Google engineers published a paper in the journal Nature, which successfully predicted the spread of H1N1 across the United States through the massive search data accumulated by Google. In the analysis of influenza trends and regions, Google used billions of search records, processed 450 million different digital models, and constructed an influenza prediction index. -remarks “label =” Remarks “> (CDC) The correlation of official data is as high as 97%, but it is 2 weeks ahead of CDC.

In the face of the epidemic, time is life and speed is wealth. If GFT can maintain this “predictive” ability, it will obviously win the opportunity for the entire society to control the epidemic in advance.

However, the myth of prophecy did not last long. In 2014, GFT received media attention again, but this time it was because of its bad performance . In 2014, researchers published the article “Fables of Google Flu: The Trap of Big Data Analysis” in Science, stating that in 2009, GFT did not predict non-seasonal influenza A-H1N1. For 108 weeks from August 2011 to August 2013, GFT was 100 weeks higher than the CDC reported incidence of influenza. How much is overrated? In 2011-20In the 12th quarter, the predicted incidence of GFT was more than 1.5 times the reported value of CDC; by the 2012-2013 season, the predicted incidence of GFT was more than twice the reported value of CDC.

(The chart is from The Parable of Google Flu: Traps in Big Data Analysis | Science, 2014)

Although GFT adjusted its algorithm in 2013 and responded that the main culprit of the deviation was the large media coverage of GFT, which led to changes in people’s search behavior, GFT’s forecast of influenza incidence in the 2013-2014 season is still high. 1.3 times the value reported by the CDC. And the systematic errors previously discovered by the researchers still exist, that is, the “wolf is coming” error is still being made.

What factors have been missed by GFT that put this prediction system in a dilemma?

According to the analysis of researchers, GFT’s big data analysis has such a large systematic error, and its collection characteristics and evaluation methods may have the following problems:

I. Big Data Hubris

The so-called “big data arrogance” is the premise given by Google engineers, that is, the big data obtained by user search keywords contains the full data collection of influenza diseases, which can completely replace the traditional data collection (Sampling Statistics) , not supplements. That is, GFT believes that the “collected user search information” data is completely related to the “people affected by an influenza epidemic”.

The presumption of this “arrogance” ignores the huge amount of data and does not represent the comprehensiveness and accuracy of the data. Therefore, the database samples that appeared successfully in 2009 cannot cover the subsequent years. New data features. It is also because of this “arrogance” that GFT does not seem to consider the introduction of professional health and medical data and expert experience, and does not “clean” and “denoise” user search data, resulting in subsequent estimates of epidemic incidence. Problems that are too high but unable to solve.

Second, search engine evolution

At the same time, the search engine model is not static. Google introduced “recommended related search terms” after 2011, which is the search related word pattern that we are familiar with today.

For example, for the flu search term, give a list of related flu treatments, and provide related diagnostic terms after 2012. The researchers analyzed that these adjustments may artificially drive up some searches and cause Google to overestimate the prevalence.

For example, when the user searches for “sore throat”, Google will give recommendations for “sore throat and fever” and “how to treat sore throat” in the recommended keywords. At this time, the user may do it for curiosity and other reasons. Clicking causes the user to use keywords that are not intended by the user, which affects the accuracy of the GFT collected data.

The user ’s search behavior will in turn affect the prediction results of GFT. For example, the media ’s report on the epidemic of influenza will increase the number of searches for flu-related words, and then affect the GFT ’s prediction. This is just as pointed out by quantum mechanics Heisenberg, the “principle of uncertainty” in quantum mechanics states that “measurement is interference”, thenIn the hustle and bustle world of search engines full of media reports and user subjective information, there is also the “prediction is interference” paradox. The behavior of search engine users is not entirely spontaneous. Media reports, social media hotspots, search engine recommendations, and even big data recommendations are affecting the user’s mind, resulting in a concentrated outbreak of user-specific search data.

Why are GFT forecasts always high? According to this theory, we can know that once the epidemic forecast index released by GFT rises, it will immediately trigger media reports, which will lead to more relevant information search and strengthen the judgment of GFT epidemic. No matter how the algorithm is adjusted, it cannot be changed “Unsure” result.

Three, relevant rather than causal

Researchers point out that the root problem of GFT is that Google engineers are not clear about the causal relationship between search keywords and the spread of influenza. Only focus on the data-statistical correlation characteristics. Excessive respect for “correlation” and neglect of “cause and effect” can lead to data inaccuracy.

For example, taking “flu” as an example, if the search volume of the term skyrocketed for a period of time, it may be because the release of a “flu” movie or song does not necessarily mean that the flu is really outbreak.

Although the outside world has always hoped that Google can make the GFT algorithm public, Google has not chosen to make it public. This has led many researchers to question whether the data can be reproduced repeatedly or there are more business considerations. They hope that the combination of search big data and traditional data statistics (small data) should create a more in-depth and accurate study of human behavior.

Obviously, Google doesn’t take this opinion seriously. GFT officially went offline in 2015. However, it continues to collect search data from related users and only provides it to the US Centers for Disease Control and Research Institutions.

Why BlueDot was the first to successfully predict: a concerto of artificial intelligence and artificial analysis


As we all know, Google was already laying out artificial intelligence at the time, and acquired DeepMind in 2014, but still maintained its independent operation. At the same time, Google did not invest more attention in GFT, so it did not consider adding AI to the GFT algorithm model, andIt was opted for GFT to “euthanasia”.

Almost at the same time, the BlueDot we see today is born.

BlueDot is an infectious disease expert Kamran Khan (Kamran Khan) An automatic epidemic surveillance system is established, which analyzes 65 species daily About 100,000 articles in language to track the outbreak of more than 100 infectious diseases. They attempted to use these targeted data collections to gain clues to the outbreak and spread of a potential epidemic.

BlueDot has been using Natural Language Processing (NLP) and Machine Learning (ML) to train this” disease automatic monitoring platform “, This can not only identify and eliminate irrelevant” noise “in the data, for example, the system recognizes that this is an outbreak of anthracnose in Mongolia, but also It was just a reunion of the heavy metal band “Anthrax”, which was established in 1981. Another example is that GFT only understands the users of “flu” related searches as possible patients with the flu disease. Obviously there are too many unrelated users and the overestimation of the epidemic accuracy rate. This is also the advantage of BlueDot different from GFT in the screening of key data.

As in this forecast of the new coronavirus epidemic, Kamland said, BlueDot searched foreign language news reports, animal and plant disease networks and official announcements to find the source of the epidemic information. But the platform’s algorithm does not use social media postings because the data is too cluttered and more “noisy”.

For prediction of transmission path after virus outbreak, BlueDotMore inclined to use access to global air ticket data to better discover the movements and time of movement of infected residents. In early January, BlueDot also successfully predicted the spread of the new coronavirus from Wuhan to Wuhan, Beijing, Bangkok, Seoul and Taipei within a few days.

The new crown virus outbreak is not the first BlueDot success. In 2016, through the analysis of the establishment of an AI model of the Zika virus transmission path in Brazil, BlueDot successfully predicted the occurrence of Zika virus in Florida, USA, six months in advance. This means that BlueDot’s AI surveillance capabilities can even predict the geographical spread of the epidemic.

From failure to success, what are the differences between BlueDot and Google GFT?

I. Forecasting technology differences

Previous mainstream predictive analysis methods used a series of data mining techniques. The “regression” method in mathematical statistics often used includes multivariate linear regression, polynomial regression, multi-factor logistic regression and other methods. Its essence It is a kind of curve fitting, which is the “conditional mean” prediction of different models. This is exactly the technical principle of the prediction algorithm used by GFT.

Before machine learning, multiple regression analysis provided an effective way to deal with a variety of conditions. You can try to find a result that minimizes errors in prediction data and maximizes “goodness of fit”. But the desire of regression analysis for unbiased prediction of historical data cannot guarantee the accuracy of future forecast data, which will cause the so-called “overfitting”.

According to Shen Yan, professor of Peking University’s National Academy of Research, in the article “Glory and Traps in Big Data Analysis-Talking from Google Flu Trends,” Google GFT does have an “overfitting” problem. That is, in 2009, the GFT can observe all the CDC data from 2007 to 2008. The standard used for the method of training and testing data to find the best model is- Highly fit the CDC data at any cost.

So, in the Science paper in 2014, it was pointed out that when GFT predicts the 2007-2008 influenza epidemic rate, some seemingly weird search terms will be lost, and another 50 million search terms will be used to simulate A total of 1152 data points. After 2009, the data to be predicted by GFT will face the existence of more unknown variables, including its own predictions.Data feedback. No matter how the GFT is adjusted, it still faces the problem of overfitting, making the overall system error unavoidable.

BlueDot has adopted another strategy, The combination of medical and health expertise with artificial intelligence and big data analysis technology to track and predict the trend of global distribution and spread of epidemic infectious diseases, and give The best solution.

BlueDot mainly uses natural language processing and machine learning to improve the effectiveness of this monitoring engine. With the improvement of computing power and machine learning in recent years, the method of statistical prediction has been fundamentally changed. Mainly the application of deep learning (Neural Network) , which uses the “back propagation” method, which can continuously train, feedback, and learn from the data To obtain “knowledge”, after systematic self-learning, the prediction model will be continuously optimized, and the prediction accuracy will also improve with learning. The input of historical data before model training becomes even more critical. Sufficiently rich feature data is the basis for training the prediction model. The cleaned high-quality data and the extraction of properly labeled features are the top priorities for predicting success.

Second, prediction mode differences

Unlike the way GFT completely passes the prediction process to the results of big data algorithms, BlueDot does not completely hand the prediction to the AI ​​monitoring system. BlueDot is handed over to human analysis after data screening. This is exactly the difference between GFT’s big data analysis and the “expert experience” forecasting model of BlueDot.

The big data analyzed by AI is to select a specific website (Medical and Health Disease News) and platforms (Airline tickets, etc.) Information. And the early warning information given by AI also needs to be re-analyzed by the relevant epidemiologist in order to confirm whether it is normal, so as to assess whether these epidemic information can be released to the society as soon as possible.

Of course, these cases do not yet indicate that BlueDot has been completely successful in predicting epidemics. First of all, will there be some prejudices in the AI ​​training model, for example, to avoid underreporting, will the exaggerated severity of the epidemic be over-exaggerated, and the problem of “the wolf is coming” again? Second, is the data evaluated by the monitoring model valid, such as BlueDot’s careful use of social media data to avoid excessive “noise”?

Fortunately, as a professional health service platform, BlueDot will pay more attention to the accuracy of monitoring results than GFT. After all, professional epidemiologists are the ultimate publishers of these prediction reports, and the accuracy of their predictions directly affects their platform reputation and business value. This also means that BlueDot also needs to face some tests on how to balance commercial profitability with public responsibility and information openness.

AI predicts the outbreak of epidemics, it is only a prelude …


“The first Wuhan coronavirus warning was artificial intelligence?” This headline in the media really surprised many people. At the moment of global integration, an outbreak of an epidemic in any place may spread to any corner of the globe in a short period of time. The time for discovery and the efficiency of early warning and notification become the key to preventing epidemics.

If AI can become a better epidemiological warning mechanism, it would be the World Health Organization (WHO) and an approach to epidemic prevention mechanisms in health departments in various countries.

Then this involves the question of how these institutions can rely on the results of epidemic forecasts provided by AI. In the future, the epidemic AI prediction platform must also provide an assessment of the risk level of the epidemic and the level of economic and political risks that may be caused by the spread of the disease to help relevant departments make more secure decisions. All of this still takes time. In establishing a rapid response epidemic prevention mechanism, these organizations should also put this AI surveillance system on the agenda.

It can be said that the AI’s successful prediction of the epidemic outbreak in advance is a bright spot for humanity to cope with this global epidemic crisis. I hope that this epidemic prevention and control campaign involving artificial intelligence is only the prelude to this protracted war, and there should be more possibilities in the future. For example, AI identification and application of major infectious disease pathogens; establishment of infectious disease AI early warning mechanism based on seasonal epidemic data of major infectious disease areas and infectious diseases; AI to assist in the optimal deployment of medical supplies after an infectious disease outbreak. Let us wait and see.

This article comes from WeChat public account: brain polar body (ID: unity007) , author: Orca