This article comes from the WeChat public account “Rokid”, author Zhang Zhaohui, and Ai Faner is authorized to publish.

Before the language appeared, humans used to use limbs and gestures, this almost instinctual communication method to communicate with each other. After the machine was invented, there were still many application scenarios due to the natural advantages that gestures could not be replaced by interactive modes such as keyboard, mouse, and touch screen. In the movie “Iron Man”, the protagonist waved his hand and pushed and dragged and manipulated the virtual object. It was not too cool.

To achieve high-precision, stable gesture recognition like in movies, you need hardware and algorithm blessing, both of which are indispensable. What are the common hardware solutions for gesture recognition? How do engineers use AI algorithms to optimize recognition? What are the common gesture recognition application scenarios? Next, let the Rokid R-Lab algorithm engineer Zhang Zhaohui tell us.

Three hardware solutions for gesture recognition

The principle of gesture recognition is not complicated. It captures natural signals through hardware, just like the camera captures picture information, and then calculates the position, posture, gesture, etc. of the hand through software algorithms, and processes it into information that can be understood by the computer. At present, there are three main hardware solutions for gesture recognition:

1. Camera solution

Commonly divided color camera solutions and depth camera solutions.

  • Color Camera Program

The color camera solution only needs a normal camera to capture a color picture, and obtain the position, posture, gesture and other information of the hand in the picture through the AI ​​algorithm. The advantage is low equipment costs and easy access to data.

The current monochromatic RGB-based gesture recognition is studied in academia and industry. Commercial solutions include Ingeji, ManoMotion, and ArcSoft.

There are also some artificial intelligence open platforms that offer this approach as well. For example, the Tencent AI open platform provides static gesture recognition and hand key points, Baidu AI Open Platform and Face++ provide static gesture detection. And some open source projects such as openpose and Google Mediapipe.

How to achieve such cool gestures as Iron Man?  Rokid Technology Jungle

▲ The picture shows the keypoint detection of openpose gesture

Compared to the depth camera solution, the color camera solution lacks depth information, is greatly affected by light, is not available at night, and has no depth camera solution for stability and accuracy.

  • Deep Camera Program

This solution uses a depth camera to get images with depth information. The advantage is that it is easier to get the 3D information of the hand, and the corresponding 3D key points obtained by the AI ​​algorithm are more accurate and stable. But the disadvantage is that it requires extra equipment and hardware costs are relatively high.

Deep cameras are divided into three main categories: ToF, structured light, and binocular imaging.

Among them, the depth map obtained by ToF and structured light is relatively accurate, but the cost is relatively high. It is mostly used in the scientific research field of gestures, and there are few commercial ones, such as Microsoft HoloLens and Extreme Fish Technology ThisVR.

Binocular imaging is very suitable for gesture recognition because of its large field of view and high frame rate. The only drawback is that the size of the entire binocular camera module is large compared to ToF and structured light due to limitations in imaging principles. a lot of.

The company that uses binocular imaging is represented by Leap Motion, the largest gesture recognition company. The company uses an active binocular imaging solution. In addition to the binocular camera, there are three fill-in units that capture both hands 26DoF and static. Gestures, dynamic gestures, etc. In addition, Leap Motion also provides a very complete SDK, which is good for all platforms (except mobile platforms).

How to achieve such cool gestures as

▲ The photo shows Leap Motion’s demo

There are also companies that do binocular gestures in the country. For example, uSens Fingo is based on an active dual-purpose visual solution that provides hands with 26DoF, static gestures, and dynamic gesture recognition. Compared to Leap Motion, uSens is more focused on supporting mobile phones and other low-power embedded devices. There is also a micro-motion Vidoo Primary also has a dual-purpose gesture solution.

2. Millimeter wave radar

The representative of the millimeter-wave radar solution is Google’s specially designed radar sensor, Project Soli, which tracks sub-millimeter-accurate high-speed motion, but is still in the laboratory.

How to achieve such cool gestures as

From its published demo, it is now possible to identify individual gestures and identify small and precise gestures in a small range, which is ideal for human precision micro motor skills. But the downside is that the effective range is too small to get all the freedom of the hand.

How to achieve such cool gestures as

▲ The picture shows a demonstration of Project Soli

3. Data Gloves

Data glove refers to a special glove with a built-in sensor on the hand. The angle or position of the finger is measured by the sensor, and the position of the hand is calculated according to Inverse kinematics.

Generally used sensors have bending sensors and angle sensing, magnetic sensors, etc.

The bending sensor and the angle sensor are similar to the degree of bending of the finger. We take the DEXMO force feedback glove as an example. The glove uses a rotation sensor to capture 11 degrees of freedom of hand movement, including stretching and bending of each finger. , as well as an extra rotation of the thumb.

How to achieve such cool gestures as Iron Man?| Rokid Technology Jungle

This program’s opponent’s local motion detection is accurate and is not limited by the field of view in the visual solution. However, the disadvantage is that it is inconvenient to wear gloves on the hand, and only partial finger movements can be detected, and the position angle of the entire hand cannot be located. If you want to detect the position of the hand, DEXMO needs to be used with other 6-degree-of-freedom trackers.

Of course, DEXMO’s biggest selling point is not gesture recognition, but realistic haptics + gesture recognition. Gesture recognition + haptic feedback will definitely be an important part of human-computer interaction in the future. The recent acquisition of Leap Motion’s UltraHaptics is a company that makes tactile feedback.

There is also a magnetic sensor – the trakSTAR electromagnetic space position tracking system. The position angle of the sensor is determined by the magnetic field change of the magnetic sensor attached to the hand, and the specific position of the hand is determined according to the inverse kinematics.

How to achieve such cool gestures as

▲ The picture shows the use of trakSTAR

This solution requires 6 magnetic sensors (5 fingertips + 1 hand back) on the hand and a magnetic transmitter in front. The magnetic transmitter will form a special electromagnetic field within a certain range, and then according to the electromagnetic field strength detected by the sensor at different positions in the electromagnetic field.Differently, to infer the position angle of the fingertip and the palm. The position of all hand joint points is determined by inverse kinematics.

The disadvantage of this solution is that the effective use range is too small, the price is too expensive, and the application scenario is too small. The advantages are high precision, good stability, and all the freedom of the hand.

At present, this program is only used for pure scientific research. Recently, several gesture data sets FHAB and BigHand published by academia have been collected by this device.

How to achieve such cool gestures as

▲ The picture shows a schematic of the FHAB data set

Two types of algorithm models for gesture recognition

Through the above science, I believe that everyone has a preliminary understanding of the hardware solution for gesture recognition. But to make gesture interactions, only hardware solutions are not enough, and professional algorithms are needed.

When we get the depth map through the camera, the next step is to enter the depth map into the algorithm, which can output the 3D position of all the key points in our hand.

Hand key points can also be understood as joint points of the hand skeleton, usually described by 21 3D key points. Each 3D key has 3 degrees of freedom, and the output dimension is 21*3. So we often use a 21*3 dimensional vector to describe it, as shown below:

How to achieve such cool gestures as

▲ The picture shows the 21 hand 3D key points after visualization

Academia has proposed various algorithms to solve “depth-based hands”Potential pose estimation problems, these algorithms can be roughly divided into two methods: model-driven and data-driven.

1. Model-driven algorithm

This type of algorithm usually generates a series of hand geometric models by pre-pose pose (pose refers to pose parameters or node positions, which will be collectively referred to as poses later) and builds a search space (all possible gesture geometry models) The collection) and then find the model that best matches the input depth map in the search space.

At this time, the parameter corresponding to the model is the pose that is requested. Such algorithms belong to Generative Approaches, and we take the following paper as an example:

How to achieve such cool gestures as

Model-driven algorithms usually need to design a way to convert poses to corresponding geometric models.

This paper uses linear blend skinning (the skeletal skinning animation algorithm): meaning to cast a layer of skin on the skeleton, and let the skin change along with the bone movement, mostly used in the animation field.

First convert the pose to the corresponding mesh (left side of the image below) and further convert it to a smooth surface model (on the right side of the image below). We can understand that pose is an independent variable, the geometric model can be calculated from pose, and the geometric model corresponds to pose.

How to achieve such cool gestures as

▲ The picture shows the hand geometry model

EnteredThe depth map of the hand can be converted into a point cloud, which is equivalent to some 3D points collected on the real watch surface, such as the red and blue points in the following picture:

How to achieve such cool gestures as

This allows you to define the loss function as the distance from the point in the point cloud to the surface of the model (the red line in the image above) to describe the similarity between the depth map and the pose. The input to the loss function is the depth map and pose, and the output is the degree of difference. The smaller the output value of the loss function, the more similar the depth map and pose of the input.

Therefore, as long as the pose that minimizes the loss function is found in the search space, it is the pose that is requested. However, because the search space cannot be written in an analytical form, the minimum value of the loss function cannot be obtained at one time. Usually, only numerical calculation methods, such as PSO, ICP, etc., can be used to continuously calculate the optimal solution.

The above figure shows the pose from the beginning of the iteration to the end of the iteration from left to right. The numerical solution of this iteration usually requires higher initialization. If the initialization is not good, it takes a long time to iterate and converge. It is also possible that it cannot converge to the global minimum (because the loss function is a non-convex function), so when the algorithm is implemented, the pose of the previous frame is usually used to initialize the calculation of the current frame.

This model-driven class approach requires manual design of geometric models and loss functions. Simple geometric models are computationally intensive and complex geometric models are highly accurate. Often the design of the model requires a trade-off between accuracy and performance.

How to achieve such cool gestures as Iron Man?| Rokid Technology Jungle

▲ The picture shows different hand geometry models

The algorithmic advantage of the model-driven class is that it does not require any training data. As long as the design is good, it can be used directly after writing. The disadvantage is the need to manually design the model, calculateLarge amount, easy error accumulation leads to drift, high requirements for initialization, usually only used in the field of gesture tracking.

2. Data Driven Algorithms

This type of algorithm refers to the use of the training sample in the collected data and its corresponding label relationship, allowing the machine to learn a mapping from sample to label. Such algorithms belong to Discriminative Approaches. How to achieve such cool gestures as Iron Man?| Rokid Technology Jungle

There are many such machine learning algorithms, such as random forests used in the early days, SVMs, or the recent research on hot neural networks.

The advantage of this approach is that you don’t need to design complex models, the downside is that you need big data. But now the amount of data in the era of big data is no longer a problem, this data-driven approach has become the current mainstream research direction.

The classic methods of studying key point regression in early academic circles include Cascade regression and Latent Regression Forest. In recent years, research has focused on various neural networks such as DeepPrior series, REN, pose guided, 3D-CNN, Multi-View CNNs, HandPointNet, and Feedback Loop.

Because the neural network for gestures discussed here is not fundamentally different from the neural network of the ordinary graph, and there are many popular science articles on neural networks, there is no science here. We only pick a few representative network structures to introduce:

DeepPrior: The network structure is roughly as shown below. The initial pose is used to obtain a rough pose, and then the refine network is continuously optimized, and a low-dimensional embedding is added before the final fully connected layer, forcing the network learning to compress the feature space to a more Low dimension. This network is followed by a more optimized version of DeepPrior++.

How to achieve such cool gestures as

How to achieve such cool gestures as

Feedback Loop: The network structure is as shown below. After predicting the pose, the network uses pose to generate the depth map and predicts the better pose together with the input depth map. This pose can then be used to generate a better depth map, which iterates through loop optimization poses.

How to achieve such cool gestures as

3D CNN: The network structure is as shown below. This network converts the depth information described by the pixels on the 2D depth map into voxels (3D pixels) in TSDF mode. And replace the ordinary 2D convolution with 3D convolution.

The biggest contribution here is to move from 2D to 3D on the network structure, because the traditional 2D convolutional network is designed for 2D images and is not necessarily suitable for 3D information extraction. With 3D convolutional networks, it is easier to get 3D features, which is more suitable for 3D hand keypoint regression. How to achieve such cool gestures as

HandPointNet: The network structure is as shown below. When this network is input, the depth map is converted into a point cloud, and then PointNet is used for 3D key point return.

HandPointNet’s main contribution is the first use of PointNet in the return of gesture key points, of which PointNet is a representative network. For the first time, PointNet uses 3D point clouds as a network input rather than a 2D picture.

PointNet explores the neural network architecture in 3D space and how to extract 3D features more efficiently than the previous 3DCNN. PointNet has a more optimized version of PointNet++.

How to achieve such cool gestures as

▲ Four application scenarios for gesture recognition

The above we introduced the common gesture recognition hardware solutions and algorithms, then what are the real application scenarios of gesture recognition? Many people may think that gesture interaction is just a concept of science fiction movies. Next, we use product applications as an example to introduce some scenes that have been commercialized or potential.

1.VR Gesture

Many companies represented by Leap Motion are doing VR+ gestures. VR emphasizes immersion, and gesture interaction can greatly enhance the immersion in VR use. Therefore, the VR+ gesture is very reliable, and when VR is popularized, it will change the way people entertain.

This representative product: LeapMotion, uSens, Extreme Fish Technology, etc.

How to achieve such cool gestures as

2.AR Gestures

Many companies represented by HoloLens are doing AR glasses. AR glasses may come off the physical touch screen and mouse and keyboard input tools, and the input is image and voice. At this point, gesture interaction is indispensable, but AR is still at an earlier stage than the VR market, and it needs to continue to accumulate technology and wait for the market to mature.

There are representative products such as HoloLens, Magic Leap, Rokid Glass, Nreal, Project North Star, and Bright Wind.

How to achieve such cool gestures as

▲ The photo shows a demo of Leap Motion Project North Star

3. Desktop Gestures

Projector + gesture recognition represented by Sony Xperia Touch, project the screen onto any plane, and then simulate the touch screen operation by gesture recognition.

The gesture recognition used here is relatively simple, basically only recognize single point and multiple points. However, it is easy to block the screen displayed by the projector with the middle hand, and there is also a problem of display definition. This scene may be more of an exploration, and the possibility of landing is less.

How to achieve such cool gestures as

But we can open a hole: if you change the flat gesture recognition here into an air gestureRecognition, planar projection changed to holographic 3D projection, then you can achieve the holographic console mentioned in the article “Iron Man” at the beginning of the article.

Aerial gesture recognition has been done, but there is no real holographic projection technology, only some pseudo holographic projection. Such as reflective and fan-type pseudo holographic projection.

Reflective pseudo-holographic projection simply reflects an image of an object into a reflective panel (plastic plate) into a virtual image. Because the board is transparent, it seems to be directly imaged in the air. The fan-type pseudo-holographic projection uses the visual persistence phenomenon of the human eye to make the picture appear to be directly imaged in the air.

How to achieve such cool gestures as

▲ The above picture shows a reflective pseudo-holographic projection

How to achieve such cool gestures as

▲ The picture above shows a fan-type pseudo-holographic projection

The biggest problem with these pseudo-holographic projections is that they can’t interact with the virtual image by hand. In order to realize the holographic workbench in Iron Man, the most likely way is to realize it in the AR glasses. As long as the calculated gesture pose is aligned with the display device, the interaction between the hand and the virtual image can be realized. .

This representative product includes: Xperia Touch, Light Shadow Magic Screen, etc.

4. Car Gesture

Car gesturing refers to the use of gestures to interactively control some of the console buttons of the console when driving. Compared to the traditional way, the advantage of gestures is that you don’t have to press buttons or poke the screen every time, which is more intelligent and convenient.

In use touchWhen the screen is on, the driver needs to look at the screen to know where the button is, and the action of watching the screen has great security risks. Gestures can be manipulated directly with voice feedback without the need to stare at the screen.

Car gestures can improve driving safety to a certain extent, but it also has some shortcomings. It is easy to get tired of gestures in the air. Coupled with the accuracy and delay of gesture recognition, it is far less convenient than simply turning the button and tapping the screen with your hand. Therefore, the current industry is basically using the traditional way + gesture operation assistance.

The representative products are: BMW 7 Series, Bayern Car, Junma SEEK 5, etc.

How to achieve such cool gestures as

Summary

In the AI ​​era, the addition of interactive methods such as speech recognition and gesture recognition allowed us to interact more with the machine.

Voice interaction has a first-mover advantage in the era of artificial intelligence, is gradually being landed and is expected to be applied on a large scale. From the perspective of the landing scene of gesture recognition, this interaction is still in the early stages of the industry.

But it is foreseeable that gesture interaction is an essential part of future human-computer interaction.

In your imagination, what other scenes can use gesture interaction? Welcome to leave a message to discuss.

Author: Zhang Zhaohui, graduated from Zhejiang University, College of Science (Physics), major research interests include gesture recognition, pose estimation, stereo vision, deep learning, etc. Now working for Rokid R-Lab image algorithm engineer, responsible for the development of gesture algorithms.