This article is from the public number:Fresh Jujube Classroom (ID:xzclasscom) author: Jun jujube, title figure from: vision China, the original title:” video coding Basics zero “

The era we are in today is the era of mobile Internet, and it can be said to be the video age.

From fast broadcast to vibrato, from “Sansei III” to “Yanyu Raiders”, our lives are affected by more and more video elements.

All of this is inseparable from the continuous upgrading of video capture technology and the growing power of the video production industry.

In addition, it is also inseparable from the rapid advancement of communication technology. Imagine if you still have a 56K Modem dial-up or a 2G mobile phone, can you enjoy the video experience of 1080P or even 4K now?

In addition to video capture tools and network communication technology upgrades, we can enjoy the convenience and fun of video, and an important factor is the rapid advancement of video coding technology.

Today, I will give you a zero-based science on it.

Image Basics

Before you say the video, let’s talk about the image.

Image, as everyone knows, is made up of many “points with colors.” This point is “pixels“.

Pixel points are called Pixel (abbreviated as PX). This word is made by Picture(image) and Element(element)The twoThe letter of the word is composed of.

The movie “Pixels”, 2015

Pixels are the basic unit of image display. We usually say the size of a picture, for example 1920×1080, which is 1920 pixels in length and 1080 pixels in width. The product is 2,073,600, which means that the picture is two million pixels.

1920×1080, this is also known as the resolution of this picture.

Resolution is also an important indicator of the display

So, what is the PPI we often say?

PPI, which is “Pixels Per Inch”, the number of pixels per inch. That is, how many “pixels” can be dropped per inch of area on the phone (or display) screen.

This value is of course the higher the better! The higher the PPI, the sharper and more detailed the image.

The previous feature machines, such as Nokia, had very low screen PPIs and had a strong sense of graininess.

Afterwards, Apple created an unprecedented “retina” screen with PPI values ​​up to 326 (326 pixels per inch screen), the picture quality is clear, no more grainy.

Pixels must have colors to make a colorful picture. So, how should this color be represented?

Everyone knows that the colors in our lives can have countless categories.

It’s just the lipstick number of the sister papers, which is enough to make us stunned

In computer systems, we can’t express colors in words. Otherwise, even if we are not crazy, the computer will be mad. In the digital age, of course, numbers are used to express colors.

This brings up the concept of “color component digitization“.

Before we studied in art class, any color can be passed red(Red), green (Green), blue (Blue) is calibrated to a certain percentage . These three colors are called “three primary colors”.

In the computer, R, G, and B are also referred to as “primary color components.” Their values ​​range from 0 to 255, a total of 256 levels (256 is the 8th power of 2).

Therefore, any color can be represented by a combination of three values ​​of R, G, and B.

RGB=[183,67,21]

How many colors can you express in this way? 256 × 256 × 256 = 16,777,216, so also referred to as 16 million colors. RGB three colors, each color has 8bit, the color expressed in this way, also known as 24-bit color (occupies 24bit) .

This color range has exceeded all the colors visible to the human eye, so it is also called true color. If it is higher, it will be meaningless to our eyes, and it will not be fully recognized.

Video Coding Basics

Okay, I just said the image. Now, let’s start talking about video.

The so-called video, everyone watching animation from an early age, know how the video came from? That’s right, a lot of pictures are continuous, it’s a video.

What are the indicator parameters used to measure the video?

The most important one is frame rate (Frame Rate).

In the video, a frame (Frame) refers to a still picture. The frame rate refers to the number of pictures included in the video per second (FPS, Frame per second).

The higher the frame rate, the more realistic and smooth the video will be.

With the video, there are two issues involved, one for storage and two for transmission.

The reason why there is video coding, the key is this: a video, if not encoded, its size is very large.

Take a video with a resolution of 1920×1280 and a frame rate of 30 as an example.

1920×1280=2,073,600(Pixels pixels)

Each pixel is 24bit(previously calculated)

That is 2037600×24=49766400bit per picture

8 bit(bit)=1 byte(bytes)< /span>, so, 49766400bit=6220800byte≈6.22MB.

This is the original size of a 1920×1280 image, multiplied by a frame rate of 30, which means that the video size per second is 186.6MB, about 11GB per minute, a 90-minute movie, about 1000GB……

Is it scared? Even if your computer hard drive is 4TB (actually 3600GB), you can’t put up a few big sisters!

Not only to store, but also to transmit, or where does the video come from?

If you follow the 100M network speed (12.5MB/s), the next movie, it takes 22 hours….. .Crash again…

Because of this, the silk engineers have proposed that the video must be encoded.

What is encoding?

Encoding is the conversion of information from one form (format) to another form (format).

Video encoding is the conversion of one video format to another.

The ultimate goal of coding, to put it bluntly, is to compress.

A variety of video coding methods are used to make the video smaller and more convenient for storage and transmission.

Let’s take a look at the whole process of video recording from recording to playback as follows:

The first is video capture. Usually we use video cameras and cameras for video capture. Due to space limitations, I am not going to explain the principle of CCD imaging to everyone.

After the video data is acquired, analog-to-digital conversion is performed to convert the analog signal into a digital signal. In fact, many cameras now (camera) directly output digital signals.

After the signal is output, pre-processing is also performed to change the RGB signal into a YUV signal.

Before we introduced the RGB signal, what is the YUV signal?

In a nutshell, YUV is another way to digitally represent colors.

The reason why video communication systems use YUV instead of RGB is mainly because RGB signals are not conducive to compression.

In the YUV mode, the concept of brightness is added.

In the last decade, video engineers have found that the resolution of the eye for light and dark is more subtle than the resolution of the color, that is, the human eye is less sensitive to chromaticity than to the brightness The degree of sensitivity.

So, engineers believe that it is not necessary to store all color signals in our video storage. We can leave more bandwidth to the black-and-white signal (called “brightness”), leaving less bandwidth to the color signal. Span class=”text-remarks” label=”remarks”> (called “chroma”). So, there is YUV.

The “Y” in YUV is the brightness (Luma), and the “U” and “V” are the chromaticity (Chroma).

Y’CbCr, also known as YUV, which you will occasionally see, is a compressed version of YUV. The difference is that Y’CbCr is used in the field of digital images. YUV is used in the field of analog signals. MPEG, DVD, and video cameras often say The YUV is actually Y’CbCr.

How YUV(Y’CbCr) forms an image

The storage format of the YUV stream is closely related to the way it is sampled. (Sampling is to capture data.)

There are three main sampling methods, YUV4:4:4, YUV4:2:2, and YUV4:2:0.

The specific explanation is a bit cumbersome. Just remember that usually using YUV4:2:0 sampling method, you can get 1/2 compression ratio.

After these pre-processing is done, it is the official code.

How Video Encoding Implementation

Before we said, the code is for compression. To achieve compression, various algorithms are designed to remove redundant information from the video data.

When you face a picture, or a video, think about it, if it is you, how would you compress it?

For the new goddess, I don’t want to compress a bit…

I think the first thing you think of is to find the law.

Yes, look for correlations between pixels, and their correlation between image frames at different times.

For example, if a picture (1920×1080 resolution) is all red, do I have to say 2073600 times? [255,0,0]? I only need to say [255,0,0] once, and then say 2073599 times “ibid.”

If there is a one-minute video, there are ten seconds of the image is not moving, or, with 80% of the image area, the whole process is unchanged (not moving). So, is this storage overhead saved?

Yes, the so-called encoding algorithm is to find the law and build the model. Who can find more precise rules and build more efficient models, who is a powerful algorithm.

Generally speaking, the redundant information in the video includes:

Video coding technology prioritizes the elimination of targets, which are spatial redundancy and time redundancy.

Next, Xiaozaojun will introduce to you, what kind of method is used to kill them.

The following content is a bit high, but I believe that you can still understand some patience.

Videos are formed by continuous playback of different frame pictures.

These frames are mainly divided into three categories, namely I frame, B frame, and P frame.

I frame, which is a separate frame with all the information, is the most complete picture (maximum occupied space), no need to refer to other images It can be decoded independently. The first frame in the video sequence is always an I frame.

P frame, “inter prediction coding frame”, need to refer to the different parts of the previous I frame and / or P frame in order to encode. The P frame has a dependency on the previous P and I reference frames. However, the P frame compression rate is relatively high and takes up less space.

P Frame

B frame, “bidirectional predictive coded frame”, the frame after the previous frame is used as the reference frame. Not only refer to the front, but also refer toThe face frame, so it has the highest compression ratio and can reach 200:1. However, because it relies on subsequent frames, it is not suitable for real-time transmission of (eg video conferencing).

B Frame

The size of the video can be greatly compressed by classifying the frames. After all, the object to be processed has significantly reduced (from the entire image to an area in the image).

If you grab a packet from the video stream, you can also see the information of the I frame, as follows:

Let’s take a look at an example.

This has two frames:

It seems to be the same?

No, I can see it as a GIF animation, it is different:

The person is moving, the background is not moving.

The first frame is an I frame and the second frame is a P frame. The difference between the two frames is as follows:

In other words, some of the pixels in the figure have been moved. The movement track is as follows:

This is motion estimation and compensation.

Of course, if you always count by pixel, the amount of data will be larger, so it is generally cut into different “blocks (Block ) or “macroblock (MacroBlock)“, calculate them. A macroblock is typically 16 pixels by 16 pixels.

Cut an image into a macroblock

Okay, let me sort it out.

The processing of the I frame uses the intraframe coding method, and only uses the spatial correlation in the image of the frame.

The processing of P frames uses inter-frame coding (forward motion estimation), while taking advantage of spatial and temporal correlation. Simply put, use motion compensation motion compensation algorithm removes redundant information.

Special attention is required, I frame (intra coding), although there is only spatial correlation, the entire encoding process is not simple.

As shown in the figure above, the entire intraframe coding is subject to multiple processes such as DCT (discrete cosine transform), quantization, and encoding. Due to space limitations, and more complicated, I will give up the explanation today.

So, after the video has been encoded and decoded, how to measure and evaluate the effect of codec?

In general, it is divided into objective evaluation and subjective evaluation.

The objective evaluation is to use numbers to speak. For example, calculate “signal to noise ratio / peak signal to noise ratio”.

Children’s shoes for communication should be familiar to this concept?

The calculation of signal-to-noise ratio, I will not introduce it, throw a formula, you can study it yourself if you have time…

In addition to objective evaluation, it is subjective evaluation.

Subjective evaluation is to use people’s subjective perception to directly measure, amount, and speak people’s words – “I don’t think I have the final say.”

International Standards for Video Coding

Next, let’s talk about standard(Standard).

Any technology has standards. Since the video coding, many video coding standards have been born.

Reference to video coding standards, first introduce several organizations that set standards.

First, the famous ITU (International Telecommunication Union).

The ITU is a specialized agency of the United Nations and is headquartered in Geneva, Switzerland.

The ITU has three divisions, ITU-R (formerly known as the International Radio Consultative Committee CCIR), ITU-T (formerly the International Telegraph and Telephone Consultative Committee CCITT), ITU-D.

In addition to the ITU, the other two organizations closely related to video coding are ISO/IEC.

ISO everyone knows that is to launch ISOThe “International Organization for Standardization” of the 9001 quality certification. The IEC is the “International Electrotechnical Commission“.

In 1988, ISO and IEC jointly established an expert group to develop standards for encoding, decoding, and synchronizing television image data and sound data. This expert group is the famous MPEG, Moving Picture Expert Group (Moving Picture Experts Group).

For more than 30 years, the mainstream video coding standards in the world have basically come up with them.

ITU proposed H.261, H.262, H.263, H.263+, H.263++, which are collectively referred to as H.26X series, mainly used in real-time video communication, such as conference TV, Look at the phone, etc.

ISO/IEC proposes MPEG1, MPEG2, MPEG4, MPEG7, and MPEG21, collectively referred to as the MPEG series.

The ITU and ISO/IEC started out as respective drums. Later, a joint group was formed on both sides, called JVT(Joint Video Team, video Joint Working Group).

JVT is committed to the development of a new generation of video coding standards, and later introduced a series of standards including H.264.

Compression ratio comparison

Development of video coding standards

Everyone pays special attention to HEVC in the above picture, which is H.265, which is now in full swing.

As a new coding standard, it has a great performance improvement over H.264 and is now standard on the latest video coding systems.

Finally, let me talk about packaging.

For any video, there are only images, no sound, and certainly not. Therefore, after the video is encoded, plus the audio code, it must be packaged together.

Encapsulation is the encapsulation format. Simply put, the video track and audio track that have been encoded and compressed are placed in a file according to a certain format. More common point, the video track is equivalent to rice, and the audio track is equivalent to the dish, the package format is a lunch box, a container for the food.

The current main video containers are as follows: MPG, VOB, MP4, 3GP, ASF, RMVB, WMV, MOV, Divx, MKV, FLV, TS/PS, etc.

The video after the package can be transferred, and you can also decode it through the video player.

last words

Okay! The god of the drop, finally introduced…

In fact, the reason why Xiaozaojun has to do