New hardware forms require new ways of interacting.

Editor’s note: This article is from WeChat public account “Semiconductor Industry Watch”(ID:icbank), author Li Fei.

Last week, Google released several new hardware at its Made by Google conference, including the latest Pixel 4 phones, Pixel Bud 2 smart phones, and Nest Mini smart speakers. In addition to the differences in form and traditional hardware, these new hardware also have innovations in user interaction, and these new user interaction features also allow related chips to enter our field of vision.

This chip or let gestures interact into the mainstream

New hardware forms require new ways of interacting

Google’s hardware launch last week was a continuation of a recent series of Internet company hardware launches. Before Google, Amazon and Microsoft also released their new hardware.

If we carefully analyze the hardware released by these Internet companies, the first thing we see is the determination of Internet companies to break through from the traditional smart hardware landscape. The new hardware released by these Internet companies is more intelligent, and often uses some non-traditional hardware forms (such as Amazon’s smart glasses) with artificial intelligence to achieve a refreshing effect. The ultimate goal of Internet companies to do hardware is to hope that users can use their own Internet services through hardware as an entry, so even if each hardware shipment is not large, it will be successful if it can bring a certain amount of traffic. Therefore, we will see that Amazon and Google have adopted a similar shotgun, that is, to release a lot of different hardware at a time, instead of concentrating all resources to develop one or two key hardware.

As mentioned earlier, the new hardware developed by such Internet companies has a new form, and its ultimate goal is to interact with users and connect users to the services of Internet companies. So, how to match the new user interface with these new hardware forms becomes very important. At present, the mainstream touch screen interaction scheme cannot meet the requirements of the new form of intelligent hardware, so it is imperative to explore the next generation user interface and related hardware chips.

Terminal voice interaction

In the new user interface, the most widely accepted language is currently spoken.Tone interaction. Voice interaction officially entered the mass consumer application began with Apple’s launch of Siri, after which the release of Amazon’s Echo series of smart speakers really ignited the entire consumer voice interactive market. Google is also unwilling to lag behind. After launching the Google Home series of smart speakers, the Pixel 4 mobile phone, Pixel Bud 2 smart earphones and Nest Mini released at the conference have the latest voice interactive interface and related machine learning chip support.

So, what is the difference between the voice interaction in Google’s newly released hardware and the previous voice interface? We believe that the biggest difference is to emphasize terminal computing, and to calculate the voice interaction as much as possible on the terminal without transmitting to the cloud. Functionally speaking, the voice interaction interface calculated by the terminal can satisfy the basic interaction when the network connection is not available, thereby greatly expanding the practical scenario.

In terms of performance, network transmission introduces a lot of energy consumption and delay, so if you can do most of the voice interaction in the local, you can greatly extend the battery life of the smart device and complete it in a shorter time. User demand response to enhance the user experience. Finally, from the point of view of compliance, there are more and more Internet users’ supervision of user data. Therefore, the voice interaction operation can be done locally instead of being uploaded to the cloud to avoid the suspicion of infringing user privacy.

In terms of computational complexity, local voice interaction interfaces can be divided into two categories, one is low complexity computing (such as keyword recognition), the complexity of such calculations is not high, but needs to be as low as possible. Energy, and the delay needs to be as small as possible. Another type of computational complexity is high (for example, real-time speech transcription into text and the Assistant class requires a certain semantic understanding of the user’s speech input).

In this conference, the voice interaction interface of Pixel Bud 2 belongs to the former. The main feature of Pixel Bud 2 voice interaction is that it can complete the functions of sending text messages, reading text messages, playing music, etc. according to the user’s instructions. According to Google, Pixel Bud 2 includes a dedicated machine learning chip to complete this type of voice interface.

After carefully analyzing the implementation of Pixel Bud 2’s smart assistants, we believe that the main function of Pixel Bud 2’s voice assistant is to recognize the user’s voice commands, and to complete the relevant instructions through the Bluetooth-connected mobile phone.

For example, if the user gives a “read text message” command by voice, the voice assistant in the smart headset first needs to recognize that the user is in the next voice command, and secondly needs to recognize the approximate content of the voice command (“read text message”) And the command is sent to the mobile phone, and then the TTS algorithm of the mobile phone converts the short message into a sound and transmits it to the earphone through Bluetooth and plays. In such a process, the voice interface key of the headsetWord recognition algorithms need to perform more complex functions than traditional single keyword wake-ups.

This chip or let gestures interact into the mainstream

In addition to Pixel Bud 2, Google also uses an offline voice model on the Nest Mini and Pixel 4 to support the voice interface. According to Google’s official statement, the dedicated machine learning acceleration chip on the Nest Mini allows Google Assistant to complete the response faster, while the offline voice model on the Pixel 4 phone can perform more complex voice interactions, such as offline voice transcription into text. , complex multi-round voice commands, etc. (for example, you can ask assistant to find a picture and send it to a contact).

Technically, the first class of low-complexity speech keyword recognition algorithms is currently implemented using convolutional neural networks. Convolutional neural networks in speech interfaces have lower performance requirements than convolutional neural networks used in computer vision applications, but due to hardware limitations in application scenarios (eg, chips in headphones are less likely to be paired with DRAM) How to achieve high-accuracy keyword recognition under the conditions of lowest hardware cost and minimum power consumption is still challenging.

For example, to minimize power consumption, the relevant modules in the chip will need to operate at very low supply voltages, even lower than the minimum voltage provided by Fab, which brings the low-power design flow. challenge. In addition, because DRAM cannot be loaded in such applications, it poses a lot of challenges for the neural network model. How to achieve a good balance between model size and accuracy requires a lot of work.

Overall, this type of design requires a combination of hardware and software to ensure reasonable accuracy and lowest possible power consumption. Based on the current information, we estimate that the machine learning chip used in Pixel Bud 2 is likely to be integrated with a DSP or Google’s own IP on a traditional TWS wireless headset chip to perform such low-power calculations. Due to limitations on cost and hardware size, we believe that the future ultra-low-power voice keyword interaction chip solution for this direction is most likely to exist in the form of IP, or integrated in the main control chip of the headset, or with the front-end microphone. Integrated together.

The voice interaction on Pixel 4 phones is a typical high-complexity speech model (the second type of computation), which often requires the use of a circular neural network rather than a convolutional neural network. Although the calculation of the cyclic neural network is mainly matrix calculation, how to optimize the model and on-chip memory to minimizeThe consumption of access is still the most critical design point.

Unlike convolutional neural networks, the degree of data reuse in cyclic neural networks is not high, so how to develop and optimize the corresponding model/hardware must use a different method than convolutional neural networks, which is also the current circulating nerve. The main challenge of network computing hardware. Compared with the traditional convolutional neural network, the hardware support of the cyclic neural network is still in the early stage of development in the industry, but I believe that with the popularity of such complex offline voice interaction applications, there will be more and more design solutions. And chip solutions emerged.

We estimate that there are several possible solutions in the future: first, for applications where power consumption and performance requirements are not critical, by designing the model of the cyclic neural network to a very small size, it can be similar to DSP or NEON. Classes support matrix-accelerated IP blocks to do calculations. The advantage of this method is that it can be deployed quickly. It only needs to design the software. There is no need to make much changes on the hardware, but the problem is that it is impossible to optimize the circular neural network for memory access. Where there is a higher demand for performance and power consumption, a more specialized hardware architecture is expected to complete the acceleration. For example, circulating neural networks currently used for speech applications tend to be sparse, so higher performance and lower power consumption can be expected by designing dedicated accelerators that support sparse matrix access and operations.

In summary, with the popularity of voice interfaces, we are expected to see the emergence of DSP-like IP on more voice-related hardware, and with the popularity of offline voice complex interaction, we are expected to see dedicated voice. Accelerate chip/IP.

Interlace gestures

In addition to voice interaction, another highlight of Google’s Pixel 4 phone is the use of millimeter-wave radar for gesture interaction.

The millimeter-wave radar chip on Pixel 4 is the commercialization of the original Google Advanced Research Project (ATAP) project soli, which uses the 60 GHz band and can detect the distance between the target and the mobile phone by radar. Change to achieve the gesture gesture operation.

This chip or let gestures interact into the mainstream

Specifically, the technical principle of the radar sensor chip is to first emit electromagnetic waves, and the emitted electromagnetic waves are reflected back to the sensor end by the user’s hand, and the position and dynamics of the user’s hand can be detected according to the echo, and thereby Complete 3D non-contact gesture detection.

This chip or let gestures enter interactivelyMainstream

The Pixel 4 radar chip uses the 57-64 GHz band, which theoretically achieves millimeter-level resolution. According to the millimeter-wave radar sensor chip released by Project Soli (the prototype version of the radar chip used in Pixel 4), the chip size is about 8mm x 10mm, and there is an antenna array (in the green frame) on the chip. Beamforming, based on official information, integrates four transmitters and two receivers on the chip, using beamforming to improve resolution.

This chip or let gestures interact into the mainstream

The use of millimeter wave radar also has limitations. The main problem is the hardware’s need for size and power consumption – millimeter-wave radars require complex antennas and/or multiple radar transceiver arrays if high-precision, high-resolution detection is required. In the 60 GHz band, complex antenna arrays are bulky, and if multiple radar transceiver arrays are used, system power consumption is greatly increased. The media’s resolution for the Pixel 4 mid-millimeter wave radar is not high, and it is said that the space left in the Pixel 4 hardware design is too small to be able to mount a radar transceiver array that can achieve high resolution accuracy. Antenna array. Of course, this problem is expected to be solved later by optimizing the hardware design.

In fact, using gestures to operate smart devices has always been the direction of industry development. The traditional method is to use a camera combined with machine vision. It is difficult for a 2D camera to detect changes in the depth direction of the gesture, thus limiting the interaction. Microsoft Xbox’s Kinect uses the ToF 3D camera to support gestures, but the environment in which 3D cameras are used is limited: the opportunity to structure light is too slow, and the ToF-based solution is less rewarding in bright environments.

In addition to visual solutions, ultrasound is also a viable option. The ultrasonic scheme is similar to the millimeter wave scheme, except that the ultrasound scheme uses ultrasound rather than electromagnetic waves. The advantage of the ultrasonic solution is that the power consumption is small (can be less than 1mW and the millimeter wave scheme consumes 10-100mW). The disadvantage is that ultrasonic components that cannot be realized by the CMOS process must be used, and the millimeter wave scheme can be completely implemented using CMOS circuits. Higher degrees. Therefore, in the smart deviceIn the field of air interaction, millimeter wave radar and ultrasonic waves can be said to have different merits in specific technical indicators.

If we take a long view, we think that the solution based on electromagnetic waves and millimeter waves has greater scalability. We believe that adding a millimeter-wave radar to a mobile phone is only the first step in this type of interaction. Interactions based on electromagnetic waves (including millimeter waves) will appear in more smart appliances in the next few years. In addition to gesture interaction, electromagnetic waves can detect personnel and object recognition in the room, and are expected to seamlessly interface with WiFi devices, thus eliminating the hassle of installing a camera and privacy issues. Therefore, RF chips for interactive applications are expected to become a new category in the next few years.

Looking into the future, millimeter-wave radars for human-computer interaction mainly need to overcome the bottleneck of module size and power consumption. To this end, the radar itself must be optimized to improve the signal-to-noise ratio, so that the transceiver can be reduced in antenna size/array. The resolution can still be achieved in the case of quantity; or the antenna design can be optimized to still provide very low attenuation in small sizes. We believe that as these technology bottlenecks are gradually broken, we are expected to see more millimeter-wave-based interactions appear in smart devices.