When you are touching AI, AI is also remembering you

Editor’s note: This article is from WeChat public account “Brain pole body” (ID: unity007), author of Tibetan fox.

Deep learning is integrated with the traditional industry in application, and AI has achieved an unprecedented explosion. But as Stanford University professor Li Feifei said, (deep learning) has a long way to go in terms of intelligence, manpower or machinery.

There is no end to learning, but for a long time, there has been no significant significant improvement in the field of algorithms. This has led to some inherent shortcomings in the deployment of the model, and it has not stopped the AI ​​from being questioned. For example, the privacy problem brought about by the proliferation of artificial intelligence is obviously necessary for the optimization and improvement of the algorithm while requiring the self-discipline of the technology enterprise.

How does AI affect people’s privacy? It may not be possible to answer this complicated question with just one article, but we hope to start throwing it now.

When the neural network has memory

Before discussing privacy issues, let’s talk about the old LSTM model.

About its role, we have already introduced a lot. In simple terms, we add the concept of memory to the neural network, so that the model can remember the information on the long-term sequence and make predictions. The ability of AI to write more fluent sentences, smooth and natural multi-round dialogues with humans, and so on, is based on this ability.

After a long period of time, scientists have supplemented and expanded the memory of neural networks. For example, the introduction of attention mechanisms allows the LSTM network to track information over time and accurately. Another example is the use of external memory to enhance the timing generation model and improve the performance of the convolutional network.

In general, the improvement of memory ability, on the one hand, gives the neural network the ability to perform complex reasoning on relationships, which makes its intelligence significantly improved; on the application side, the experience of intelligent systems such as writing, translation, and customer service systems Also greatly upgraded. To some extent, memory is the beginning of the AI’s tearing off the impression tag of “artificial mental retardation”.

When AI starts to have

However, having memory, generationsThere are two problems: First, the neural network must learn to forget, thus freeing up storage space and retaining only important information. For example, if a chapter ends in a novel, the model should reset the relevant information and only retain the corresponding result.

In addition, the “subconscious” of the neural network also needs to be vigilant. Simply put, after training on sensitive user data, will the machine learning model be unconsciously bring out sensitive information when it is released to the public? In this era of digitalization in which everyone can be collected, does this mean that privacy risks are increasing?

Is AI really secretly remembering privacy?

For this question, Berkeley researchers have done a series of experiments, the answer may shock many people, that is – your data, AI may be in mind.

To understand the “unintentional memory” of a neural network, you first need to introduce a concept, that is, overfitting.

In the field of deep learning, the model performs well on the training data, but the same precision or error rate is not achieved on the data set other than the training data, which is the over-fitting. The main reason for this difference from the laboratory to the actual sample is that there is noise in the training data, or the amount of data is too small.

As a common side effect of deep neural network training, overfitting is a global phenomenon, that is, the state of the entire data set. And to test whether the neural network will secretly “remember” the sensitive information in the training data, but to observe the local details, such as whether a model has a special complex for an example (such as credit card number, account password, etc.) .

When AI starts to have

In this regard, Berkeley researchers conducted a three-stage exploration to explore the model’s “unintentional memory”:

First, prevent over-fitting of the model. By performing gradient descent on the training data and minimizing the loss of the neural network, the final model is guaranteed to achieve nearly 100% accuracy on the training data.

Then, give the machine a task to understand the underlying structure of the language. This is usually done by training the classifier over a series of words or characters in order to predict the next token that will appear after seeing the previous context marker.

Finally, the researchers conducted a control experiment. In a given standard penn treebank (ptb) In the data set, a random number “281265017” is inserted to be used as a security token. Then train a small language model on this expanded data set: predict the next character given the previous character of the context.

In theory, the size of the model is much smaller than the data set, so it is impossible to remember all the training data. So, can it remember that string of characters?

The answer is YES.

The researcher enters a prefix for the model, “The random number is 2812”, and the model happily and correctly predicts the entire remaining suffix: “65017”.

Even more surprising is that when the prefix is ​​changed to “random number is”, the model does not immediately output the string “281265017”. The researchers calculated the possibility of all 9-bit suffixes, and the results showed that the inserted string of security token characters was more likely to be selected by the model than other suffixes.

At this point, a rough conclusion can be drawn cautiously that the deep neural network model does not unconsciously remember the sensitive data that is fed to it during the training process.

When AI starts to have

When AI has a subconscious mind, should humans not panic?

We know that today AI has become a cross-scenario, cross-industry social movement, from recommendation systems, medical diagnosis, to dense city cameras, more and more user data is collected to nurture algorithm models, both of which May contain sensitive information.

In the past, developers often anonymized sensitive columns of data. But this does not mean that the sensitive information in the data set is absolutely safe, because attackers with ulterior motives can still reverse the original data by means of table lookup.

Since sensitive data is inevitable in the model, measuring the degree of memory of a model’s training data is also a testament to the safety of future algorithm models.

There are three doubts that need to be addressed here:

1. Is the “unintentional memory” of neural networks more dangerous than traditional overfitting?

Berkeley’s research concluded that although the “unintentional memory” after the first training, the model has begun to remember the inserted security characters. But the test data shows that the peak of data exposure in “unintentional memory”Values, often as the test loss increases, peak and begin to decline before the model begins to overfit.

Therefore, we can conclude that “unintentional memory” is not more dangerous than over-fitting, although it has certain risks.

When AI starts to have

2. What are the specific risks of “unintentional memory”?

Of course, not being “more dangerous” does not mean that unintentional memory is not dangerous. In fact, the researchers found in the experiment that with this improved search algorithm, 16-digit credit card numbers and 8-digit passwords can be extracted with only tens of thousands of queries. Specific attack details have been made public.

In other words, if someone inserts some sensitive information into the training data and publishes it to the world, then the probability of it being exposed is actually very high, even if it does not seem to have a fitting phenomenon. Moreover, this situation does not immediately cause concern, which undoubtedly greatly increases the security risk.

3. What are the prerequisites for privacy data being exposed?

At present, it seems that the “safe characters” inserted into the dataset by the researchers are more likely to be exposed than other random data and exhibit a normal distribution trend. This means that the data in the model does not have the same probability of exposure, and those that are deliberately inserted are more dangerous.

In addition, it is not an easy task to extract the sequence in the model “unintentional memory”, which requires pure “brute force”, that is, unlimited computing power. For example, the storage space for all 9 social security numbers takes only a few GPUs for hours, and the data size of all 16 credit card numbers takes thousands of GPU years to enumerate.

At present, as long as there is quantification of such “unintentional memory”, the security of sensitive training data is controlled within a certain range. That is to know how much training data is stored in a model, and how much is over-memorized, thus training a model that leads to the optimal solution, helping people to judge the sensitivity of the data and the possibility of the model leaking data.

In the past, we mentioned AI industrialization, mostly focusing on some macro level, how to eliminate algorithmic bias, how to avoid the black box of complex neural networks, how to “ground gas” to achieve technical dividends. Now with the gradual completion of basic transformation and concept popularization, let AI move toward refined, microscopic level of iterative riseLevel, perhaps the future that the industry is looking forward to.