This article comes from WeChat public account “Rokid” (ID: Rokid1115), author Zheng Jiewen, Ai Fan The child is authorized to publish.

Every wan (nian) straight (dan) male (shen), dream of coming off work on the road to encounter the movie “Her” robot girl Samantha. Although “only the sound is not seen by others”, you can feel the interpretation of various emotions only by listening to the sound.

Do not talk to your robotic girlfriend like Scarlett Johansson?| Rokid Technology Jungle

The live voice behind Samantha comes from Scarlett Johansson. Some people say, “The sound of the sound has satisfied my illusions about her.”

It can be said that the elimination of the gap between the human and the machine, the distance between the close, the sound is crucial.

In real life, the AI ​​voice assistant speaks far less than our ideal voice.

Speaking for your robotic girlfriend is not like Scarlett Johansson? Today, Rokid A-Lab speech synthesis algorithm engineer Zheng Jiewen will talk about speech synthesis technology and analyze the reasons. Below, Enjoy:

Do not talk to your robotic girlfriend like Scarlett Johansson?| Rokid Technology Jungle

Technical principles behind TTS – front-end and back-end systems

The technique for letting the AI ​​voice assistant speak is called TTS (text-to-speech), which is speech synthesis.

Building a natural, authentic, and pleasing TTS is the direction that scientists and engineers in the AI ​​field have been working on. However, there are always various “blockers” in the process of advancement. What exactly are they? Let’s start with the basic principles of TTS.

TTS technology essentially solves the problem of “transforming from text to speech”, in this way letting the machine speak. Do not talk to your robotic girlfriend like Scarlett Johansson?| Rokid Technology Jungle

▲ Figure 1 Speech synthesis, a question from text to speech

But this process is not easy. To reduce the difficulty of machine understanding, the scientists split the conversion process into two parts – the front-end system and the back-end system. Do not talk to your robotic girlfriend like Scarlett Johansson?| Rokid Technology Jungle

▲ Figure 2 TTS consisting of front and back ends

The front end is responsible for converting the input text into an intermediate result, and then sending this intermediate result to the back end, which generates the sound from the back end.

Next, let’s take a look at how the front-end and back-end systems work together.

Generate the front-end system of the Linguistics Specification

When we were young, we needed to learn Pinyin before we could recognize the word. With Pinyin, we can use it to spell words we don’t know. For TTS, the intermediate result of the front-end system from text is like pinyin.

However, it’s not good to have Pinyin, because what we want to read is not a word, but a sentence. If a person can’t correctly use the tone of the swaying tone to control the rhythm of his own speech, it will make people feel uncomfortable and even misunderstand what the speaker wants to convey. So the front end needs to add this kind of information to tell the backend how to “speak” correctly.

We call this stagnation message Prosody. Rhythm is a very comprehensive message. In order to simplify the problem, the rhythm is broken down into pauses., reread and other information. The pause is to tell the backend how to stop in the reading of the sentence. Rereading is to emphasize that part when reading aloud. All of this information is combined and we can call it the “Language Specification.”

Do not talk to your robotic girlfriend like Scarlett Johansson?| Rokid Technology Jungle

▲ Figure 3. The front end tells the backend what kind of content we want to synthesize by generating a “language book specification”.

The front end is like a linguist who does a variety of analysis of the plain text given to it, and then writes a specification to the back end telling the back end what sounds should be synthesized.

In the actual system, in order for the machine to speak correctly, this “Specification Book” is far more complicated than what we describe here.

Play the backend system of “pronouncer”

When the back-end system gets the “Language Specification”, the goal is to generate the sounds as described in this specification.

Of course, the machine cannot directly generate a sound out of thin air. Before that, we also need to record audio data in the studio for several to several tens of hours (depending on the technology, the amount of data used will be different), and then use this data to do the back-end system.

The current mainstream back-end systems have two methods: one is based on the method of waveform stitching, and the other is based on the method of parameter generation.

The idea of ​​waveform stitching is very simple: it is to store the pre-recorded audio on the computer. When we want to synthesize the sound, we can use the “Specification Book” from the front end to get from the audio. Look for the audio clips that best fit the specification, then stitch the clips one by one, and finally form the final synthesized speech.

For example: We want to synthesize the phrase “You look so good”, we will look for the audio clips of the words “you, true, good, look” from the database, and then stitch the four segments together. stand up.

Do not talk to your robotic girlfriend like Scarlett Johansson?  | Rokid Technology Jungle

▲ Chart 4 Use the stitching method to synthesize “You look so good”

Of course, the actual splicing is not so simple, first of all to choose the splicing unit granularity, choose the granularity also needs to design the splicing cost function.

The principle of parameter generation method and waveform stitching method is very different. The system using parameter generation method directly uses the mathematical method, first summarizes the most obvious characteristics of audio from audio, and then uses learning algorithm to learn how to put the front end. The linguistic specification maps to the converters of these audio features.

Once we have this converter from the language manual to the audio features, when we also synthesize the words “you look good”, we first use this converter to convert the audio features and then use another component. , restore these audio features to the sound we can hear. In the professional field, this converter is called the “acoustic model”, and the component that converts the sound characteristics into sound is called the “vocoder.”

Speaking for your AI voice assistant is not like a person?

If you simply give an answer to this question, there are two main reasons:

  • Your AI makes mistakes. In order to synthesize the sound, the AI ​​needs to make a series of decisions. Once these decisions are made, the resulting synthesized sound will have problems, a strong mechanical sense, and unnaturalness. Both the front-end and back-end systems of TTS have the potential to make mistakes.
  • When using AI to synthesize sound, engineers oversimplified the problem, resulting in an inaccurate process of character generation. On the one hand, this simplification comes from our own human understanding of language and human speech generation; on the other hand, commercial voice synthesis systems are also considered for cost control when they are running.

Let’s talk about the front-end and back-end bugs that caused the AI ​​helper to talk unnaturally.

Front-end error

The front-end system, as a linguist, is the most complex part of the entire TTS system. In order to generate the final “Language Specification” from plain text, the linguist has done much more than we imagined.

Do not talk to your robotic girlfriend like Scarlett Johansson?  | Rokid Technology Jungle

▲ Chart 5 Typical front-end processing flow A typical front-end processing flow is:

  • Text structure analysis

We enter a text into the system. The system must first determine the language of the text. Only know the language to know what to do next. Then divide the text into sentences one by one. These sentences are then sent to the subsequent modules for processing.

  • Text regulars

In the Chinese scene, the purpose of the text regex is to convert punctuation or numbers that are not Chinese characters into Chinese characters. For example, “This operation is 666”, the system needs to convert “666” to “6-6”.

  • Text Transliteration

That is to convert the text into pinyin. Because of the existence of multi-phonetic characters in Chinese, we can’t directly find the pronunciation of a word by the same method as the search for Xinhua Dictionary. We must make correct decisions through other auxiliary information and some algorithms. How to read it. These auxiliary information includes the participle and the part of speech of each word.

  • Prosody prediction

Used to determine the rhythm of reading a sentence, that is, swaying. But the general simplified system is only to predict the pause information in the sentence. That is, whether a word needs to be paused after reading a word, and how long it takes to stop the decision.

As you can see from the above four steps, any step is likely to go wrong. Once the error occurs, the generated linguistic specification will be wrong, and the sound of the back-end synthesis will be wrong.

A TTS system with typical front-end errors of the following types:

1, text regular error

Because our writing style and reading form are different, in the very early stage of the front end, we need to convert the writing form into the form we actually read. This process is called “text regularity” in the professional field. For example, we mentioned “666” earlier.

To turn to “6-6”, we can easily feel the error of the text regularity in the TTS system. For example, the following sentence:


“I spent 666 blocks in a room with a room number of 666.”

We know that the “666” in the front should be read as “666” and the “666” in the back should read “666”. But the TTS system is very easy to make mistakes.

Another example: “I think there is a 2-4% confidence. The score is 2-4.”

The two “2-4” should read “two to four”, “two to four”, or “two to four”? You should know at a glance how to read is correct. However, this is another problem for front-end systems.

2, phonetic error

Chinese is a profound and profound language, and it is not easy to read it correctly. One of the more difficult questions is, which one should be read in the face of multi-tone words?

For example, these two sentences: “My hair is long.” And “My hair is long.” Should the “long” here be the second “chang” or the four-read “zhang”? What?

Of course, people can easily pick the right answer. So the following sentence:

“If people want to do things, do one line and one line, and they will do the same. If they can’t, they can’t do one line, and they can’t do it.”

Maybe you have to think a little bit before you can read all the “rows” in the middle. It’s even harder for AI.

You may hear the AI ​​assistant reading the wrong words when reading a polyphonic word from time to time. This kind of error is easily captured by your ear and gives you an immediate impression: “This is definitely not a real person talking.” ~”.

Of course, polyphonic errors are just one of the phonetic errors, and there are other errors, such as soft voices, baby sounds, pitch shifts, and so on. In short, it is not easy to accurately read all the content by your AI assistant.

3, rhythm error

As mentioned above, in order to convey information more accurately, people need to have a sense of rhythm when they say a word. If a person does not pause in the middle of talking, it will make it difficult for us to understand what he said, and even we will feel that this person is impolite. Our scientists and engineers are trying to make TTS read more rhythmically and more politely. But in many cases TTS’s performance is always unsatisfactory.

This is because the language changes are too rich. According to different contexts and even different occasions, the rhythm of our reading is not the same. In rhythm, the most important thing is to discuss the rhythm of a sentence, because pause is the correct basis for reading a sentence. If the pause is wrong, the error is easily caught by the ear.

For example: “Switch single loop mode for you”. If we use “|” to indicate a pause, then the pause rhythm of a normal person reading is generally like this: “Switch for you | Single loop mode”.

But if your AI assistant says this strange rhythm of “cut for you | change the single loop mode”, your heart may be broken.

Backend error

After talking about the “verbal thief who often makes mistakes”, let’s take a look at the back end: this “pronouncer” who reads the manuscript according to the “Specification Book” given by the “linguist”.

As mentioned earlier, the back end mainly has two methods: splicing method and parameter method. Now Apple and Amazon’s AI assistants Siri and Alexa use the method of waveform stitching. In China, most companies use the parameter method. Rokid’s Ruoqi is also using the parametric method, so we will look at the possible backend errors of the parametric method.

After the backend system gets the language information given by the front end, the first thing to do is to decide each Chinese character.How long does it take to pronounce (even for each initial, how long the final will be pronounced). The component that determines the length of the pronunciation is called the “time model” in the professional field.

With this time information, the backend system can convert this linguistic specification into an audio feature through a converter (also called an acoustic model) we mentioned earlier. These audio features are then restored to sound using another component called the “vocoder”. From the length model to the acoustic model to the vocoder, every step in it can make mistakes or not perfectly produce the results we want.

A typical backend error in a TTS system is of the following types:

1, duration model error

When reading a sentence, the pronunciation time of each word is different according to the context. The TTS system must decide, based on the context, which words should be read longer and which should be shorter. One of the more typical examples is the reading of modal particles.

Normally these modal particles carry a speaker’s tone and emotion, and their pronunciation will be longer than ordinary words, such as – “Well… I think he is right.”

The “Hmm” here, in this scenario, obviously needs to be lengthened to represent a “judgment after thinking.”

But not all “hmm” has to be so long, for example – “Well? What did you say?”

The “Hmm” representative here is a questioning tone, and the pronunciation is much shorter than the “Hm” in the above sentence. If the duration model cannotThe correct decision-making time will give people an unnatural feeling.

Of course, Rokid also has his own set of patented methods for phonological pronunciation to generate very natural phonological pronunciations. In the follow-up article, we will introduce a feature article.

2, acoustic model error

The most important acoustic model error is the pronunciation that has not been seen when the “negative” of the training backend is encountered. The role of the acoustic model is to learn the acoustical features of the various “linguistic specifications” from the training sound bank. If you encounter linguistic expressions that you have not seen during training during synthesis, it is not easy for the machine to output the correct acoustic characteristics.

A common example is the baby sound. In principle, every Chinese Pinyin has a corresponding child-like sound, but in actual speech, some of the sounds are used at a very low frequency, so when recording a sound bank, it usually does not cover all the children. Instead, just keep the most common ones. At this time, there will be some phenomenon that the sound cannot be emitted or it is not good.

3, vocoder error

There are many types of vocoders, but the more common and common vocoders usually use the fundamental frequency information. So what is the fundamental frequency? The fundamental frequency is how fast the vocal cords vibrate while you are talking. Here is a simple way to feel the fundamental frequency of your speech: press your four fingers other than your thumb to your throat and start talking to yourself.

This time you will feel your throat shaking, this vibration information is our fundamental frequency information. When the voiced sound is accompanied by the vocal cord vibration, the sound that the vocal cord does not vibrate is called the unvoiced sound. Consonants are clear and turbid, while vowels are generally voiced. Therefore, the positions of vowels and voiced consonants in the synthesized speech should correspond to the fundamental frequency. If the fundamental frequency of the acoustic model output mentioned above deviates, the sound synthesized by the vocoder will sound strange.

When training the backend “speaker”, we also use algorithms to calculate the fundamental frequency information. A bad fundamental frequency extraction algorithm may cause loss of fundamental frequency, frequency multiplication or half frequency. These will directly affect the effect of the fundamental frequency prediction model. If there is no fundamental frequency predicted in the fundamental frequency, the synthesized sound sounds hoarse, and the impact on the sense of hearing is very obvious.

A good vocoder also handles the relationship between fundamental and harmonics. If the high-frequency harmonics are too obvious, it will cause a squeaking sound in the sense of hearing, and the mechanical sense is obvious.

Summary

In this article, we introduced the basic principles of TTS and analyzed the reasons why voice assistants can’t talk like real people: TTS makes mistakes in making various decisions, leading to reading errors or unnatural. At the same time, in order to allow the computer to synthesize sound, the engineer will simplify the text-to-speech problem, resulting in an inaccurate process of character generation. ThisSimplification comes from the cognitive limitations of the speech language generation process and is also limited by current computing tools.

Although there are many new methods in the AI ​​field, especially the use of Deep Learning to directly convert text to speech, and has shown a very natural sound, but to make the voice assistant completely like a human Speaking is still a very challenging job. The Rokid ALab team is also committed to exploring the breakthroughs and applications of TTS technology and looking forward to a more natural sound for users.

Author: Zheng Jiewen, Master of Artificial Intelligence, University of Edinburgh, is a professor of internationally renowned speech synthesis expert Simon King. Currently a Rokid ALab speech synthesis algorithm engineer responsible for speech synthesis engine architecture design, back-end acoustic model development and more.