Deeply learn the new architecture of Nvidia Ampere: How to achieve a 20-fold increase in AI computing power?

This article comes from the WeChat public account: core stuff (ID: aichip001) , of: heart border, from the head of FIG NVIDIA

Two weeks ago, Nvidia (NVIDIA) The press conference in the kitchen attracted the attention of the global AI field. Co-founder and CEO Huang Renxun faced a chopping board and intensively released a series of new hard-core products.

The strongest GPU, the strongest AI system, comparable to the world’s strongest supercomputing AI cluster, and edge AI products from embedded to edge servers … It is worthy of being a “rich and wealthy” and “technical and not compact” AI chip Overlord, so many multi-pound products are thrown out at once, in other words, it is estimated that it is enough for ten press conferences.

Behind these powerful AI black technologies, the core hero is NVIDIA ’s eighth-generation GPU architecture-Ampere ( Ampere) .

Yesterday, we remotely connected Jonah Alben, senior vice president of NVIDIA GPU engineering, and Paresh Kharya, director of NVIDIA accelerated computing product management, to further deepenUnderstanding of the complete look of NVIDIA’s new Ampere GPU architecture.

Here, based on the 83-page “NVIDIA A100 Tensor Core GPU Architecture” white paper and interview information, we will refine the key innovations and improvements in the computing and memory hierarchy of the Ampere GPU architecture, and deeply analyze how this new architecture implements NVIDIA The biggest performance leap so far.

A100 innovation and improvement in computing and memory structure

One sword in one or three years! “Ampere” Sheath

From the perspective of the evolution of the NVIDIA computing architecture, the iteration time of the NVIDIA computing card is not very fixed.

NVIDIA computing card evolution history

M40 GPU and K40 are separated by two years, P100 and M40At half a year, V100 and P100 are separated by one year, and A100 and V100 are separated by three years.

After three years of “suffocating”, the big move came out, and it was really extraordinary. The new A100 GPU, AI system, and AI supercomputing all achieved outstanding results.

A100 can provide training, reasoning and data analysis at the same time, improve the computing power of AI training and reasoning to 20 times of the previous generation V100, and improve the performance of HPC to 2.5 times of V100.

NVIDIA A100 GPU

A100 is supported by GA100 GPU based on Ampere architecture, with highly scalable features, supporting GPU computing and deep learning in single GPU and multi-GPU workstations, servers, clusters, cloud data centers, edge systems and supercomputers The application provides super acceleration capabilities.

The server building block HGX A100 in the form of an integrated backplane with multiple GPU configurations can form a very large 8-GPU server with 10 PFLOPS computing power.

The DGX A100 single node with 8 A100 AI systems has a computing power of 5 PFLOPS and is priced at US $ 199,000.

The DGX SuperPOD cluster consisting of 140 DGX A100 systems, with an AI computing power of 700 PFLOPS, ranks among the 20 fastest AI supercomputers in the world.

After adding four DGX SuperPODs, NVIDIA’s own supercomputing SATURNV increased its total computing power from 1.8 ExaFLOPS to 4.6 ExaFLOPS, which is more than a 155% increase.

These performance parameters that rush to break through the limit of computing power are inseparable from the new generation of NVIDIAThe Ampere architecture is supported by five key technologies at the core.

(1) Ampere architecture : The world ’s largest 7nm chip, with 54.2 billion transistors, using 40GB Samsung HBM2, memory bandwidth can reach 1.6 Tbps.

High-bandwidth HBM2 memory and a larger, faster cache provide data for the increased CUDA Core and Tensor Core.

(2) Third-generation Tensor Core : The processing speed is faster and more flexible, and the TF32 accuracy can improve AI performance by 20 times.

(3) Structured sparseness : Further improve AI inference performance by 2 times.

(4) Multi-instance GPU : Each GPU can be divided into 7 concurrent instances to optimize GPU utilization.

(5) Third-generation NVLink and NVSwitch : Efficient and scalable, the bandwidth is more than doubled compared to the previous generation.

Huang Renxun said that this is the first time that scale-out (scale out) and scale-up can be achieved on one platform (scale up) .

NVIDIA A100 GPU architecture can not only accelerate large and complex workloads, but also effectively accelerate many smaller workloads. It can not only support the construction of data centers, but also provide fine-grained workload supply and higher GPU utilization. And improved TCO.

Second, GA100 architecture: larger memory capacity and faster bandwidth

To get the ultimate performance on the GPU, what might be more interesting to CUDA personnel is the SM and memory subsystem in the GPU. We can see the changes in hardware structure from the new generation GA100 architecture diagram.

GA100 complete architecture

At the top of the picture is PCIe 4.0, the bandwidth is doubled compared to PCIe 3.0, making the communication speed between GPU and CPU faster. Below are 12 high-speed connections to NVLink.

In the middle is SM and L2 Cache. It can be seen that unlike V100, the L2 Cache in A100 is divided into two blocks, and the bandwidth it can provide is twice that of V100.

The other part in the middle is the calculation and scheduling unit, which contains 8 GPCs, each GPC has 8 TPCs, and each TPC contains two SMs. Therefore, a complete GA100 architecture GPU has 8x8x2 = 128 SMs. Each SM contains 4 third-generation Tensor Cores, that is, a complete GA100 architecture GPU has 512 Tensor Cores.

A100 GPU is not a complete version of GA100 architecture chip, which contains 108 SMs and 432 Tensor Cores. Later, as the yield rate increases, we may see a more complete GA100 architecture GPU. Compared with the Volta and Turing architectures, the computing power of each SM in the Ampere architecture has increased by a factor of two.

GA100 Streaming Multiprocessor (SM)

In order to ensure that the computing engine is fully utilized, better storage capacity is needed. There are six HBM2 memory modules on the left and right sides of the GA100 architecture diagram, and each HBM2 memory module corresponds to two 512-bit memory controllers.

The A100 GPU has 5 high-speed HBM2 memory modules and 10 memory controllers with a capacity of 40GB and a memory bandwidth of 1.555 TB / s, which is nearly 70% higher than the previous generation.

A100’s on-chip storage space has also become larger, including 40MB of L2 cache, which is 7 times larger than the previous generation.

A100 L2 cache can provide 2.3 times the read bandwidth of V100, so it can cache and repeatedly access larger data sets and models at a much higher speed than reading and writing from HBM2 memory. L2 cache residency control is used to optimize capacity utilization and can manage data to save or delete data from the cache.

In order to improve efficiency and enhance scalability, A100 has increased computing data compression, which can save up to 4 times the DRAM read / write bandwidth, 4 times the L2 read bandwidth and 2 times the L2 capacity.

In addition, NVIDIA uses the L1 cache andThe shared memory unit is combined into a memory block to improve the performance of memory access, while simplifying programming and tuning steps, and reducing the complexity of the software.

The total capacity of the L1 cache and shared memory unit in each SM is 192 KB, which is 1.5 times that of the previous V100.

CUDA 11 also contains a new asynchronous copy instruction, which can choose to bypass the L1 cache and register file (RF) and directly transfer the data Asynchronous replication from global memory is loaded into shared memory, thereby significantly improving memory replication performance, effectively using memory bandwidth and reducing power consumption.

Three, how is AI computing power increased by 20 times?

The improvement of AI and HPC computing power is mainly due to the third-generation Tensor Core used in the Ampere architecture.

In addition to supporting FP32 and FP16, NVIDIA ’s third-generation Tensor Core accelerates AI and HPC applications by introducing new precision TF32 and FP64, and supports hybrid precisionDegree BF16 / FP16 and INT8, INT4, Binary.

With the three new features of the third-generation Tensor Core, the peak computing power of the single-precision AI training and AI inference of the A100 GPU is 20 times that of the previous generation, and the peak computing power of the HPC is 2.5 times that of the previous generation.

A100 vs V100 peak performance

1, TF32 and mixed precision BF16 / FP16

TensorFloat-32 (TF32) is used in NVIDIA A100 to process matrix mathematics (Tensor operation) The new numerical format, matrix mathematics is commonly used in AI and some HPC operations.

As the AI network and data sets continue to expand, the demand for computing power is increasing. Researchers try to use lower-precision mathematical calculations to improve performance, but before doing so, some code adjustments are required, and the new precision TF32 can improve performance Without changing the task code.

The new precision TF32, like the FP32, has 8 exponent bits and can support the same number range; the mantissa digits are 10 like the FP16, and the accuracy level is higher than the AI workload requirements.

FP32 is currently the most commonly used format in deep learning training and inference. TF32 works in a similar way to FP32. TF32 Tensor Core converts to TF32 format based on the input of FP32 data and performs operations, and finally outputs the result in FP32 format.

With the help of the NVIDIA library, the TF32 Tensor Core is used to increase the peak computing power of the A100 single-precision training to 156 TFLOPS, which is 10 times the V100 FP32.

For better performance, A100 can also use FP16 / BF16 automatic mixed precision (AMP) training, only need to modify a few lines of code , You can double the performance of TF32 to 312 TFLOPS.

NVIDIA is working with the open source community that develops the AI framework and is committed to making TF32 the default training mode on A100 GPUs.

In June of this year, developers will be able to obtain PyTorch and TensorFlow versions that support TF32 in the NGC ’s NVIDIA GPU acceleration software list.

2. Structured sparse

Another key feature of the third-generation Tensor Core is needed to achieve a 20-fold increase in the operating speed of the A100 TF32-structured sparseness.

The sparse method is not unfamiliar to algorithm engineers, by extracting as many unnecessary parameters as possible from the neural network to compress the calculation of the neural network. The difficulty lies inHow to take into account faster speed and sufficient accuracy.

The sparse Tensor Core is used in the Ampere architecture, which provides up to 2 times the peak throughput without sacrificing the accuracy of the deep learning core matrix multiplication and accumulation operations.

This is a rare method of sparse optimization of intensive computing through hardware.

The method first uses dense weights to train the network, then introduces 2: 4 fine-grained structure sparse mode for pruning, and finally retrains, then repeats the training steps, using the same hyperparameters, initial weights and zero mode as the previous training .

The specific compression method is limited to 50% sparseness. It requires a maximum of two non-zero values in each adjacent 4 elements. An index data structure indicates which two data are not set to zero.

After the weight is compressed, it can effectively increase the math operation speed by 2 times.

Why can the upper limit of ideal performance be increased by 2 times? As shown in the figure below, matrix A is a 16×16 sparse matrix with a sparsity of 50%, following a 2: 4 sparse structure, and matrix B is a 16×8 dense matrix that is only half the size of A.

Standard matrix multiply accumulation Add (MMA) operation will not skip the zero value, but calculate the entire 16x8x16 matrix by N cycles Knotfruit.

With the sparse MMA instruction, each row of matrix A has only non-zero elements matching the corresponding elements of matrix B, which converts the calculation into a smaller dense matrix multiplication, achieving a 2x speedup.

In the evaluation of dozens of neural networks such as cross-vision, target detection, segmentation, natural language modeling, and translation, the inference accuracy of this method has almost no loss.

The structured and sparse A100 TF32 Tensor Core deep learning training has a maximum computing power of 312 TFLOPS, which is 20 times the V100 INT8 peak training speed of 15.7 TFLOPS.

The structured and sparse A100 INT8 Tensor Core performs deep learning inference up to 1248 TOPS, which is 20 times the V100 INT8 peak inference speed of 62 TOPS.

3. Double precision FP64

TF32 is mainly used to accelerate AI operations, and the improvement of HPC throughput mainly comes from the introduction of support for IEEE-certified FP64 accuracy.

The double-precision matrix multiply-add instruction on the A100 replaces the eight DFMA instructions on the V100, reducing instruction access, scheduling overhead, register reads, data path power, and shared memory read bandwidth.

After supporting IEEE FP64 accuracy, the peak computing power of A100 Tensor Core can reach 19.5 TFLOPS, which is 2.5 times that of V100 FP64 DFMA.

Four, multi-instance GPU: divide A100 into seven

A100 is the first multi-instance GPU with built-in elastic computing technology (MIG, Multi-Instance GPU) .

MIG can physically cut GPUs. Because there are 7 GPUs on the A100, and considering resource scheduling, A100 can be divided into up to 7 independent GPU instances.

If A100 is divided into 7 GPU instances, the computing power of one GPU instance is approximately equal to one V100, which means that A100 can provide 7 times the computing resources equivalent to V100.

The core value of MIG is that it can provide flexible GPU resources for different types of workloads.

If MIG is not used, different tasks running on the same GPU may compete for the same resources, squeezing the resources of other tasks, resulting in multiple tasks that cannot be completed in parallel.

After using MIG, different tasks can run in parallel on different GPU instances. Each instance has its own dedicated SM, memory, L2 cache and bandwidth, so as to achieve predictable performance and maximize GPU utilization. rate.

This provides a stable and reliable quality of service and effective fault isolation for the workload. It is assumed that the application running on one instance fails, and it will not affect the tasks running on other instances.

Administrators can also dynamically reconfigure MIG instances, such as using 7 MIG instances for low-throughput reasoning during the day and reconfiguring them into a large MIG instance for AI training at night.

This is especially beneficial for cloud service providers with multi-tenant use cases, resource scheduling is more flexible, and running tasks will not affect each other, further enhancing security.

In addition, there is no change in the CUDA programming mode. The AI model and HPC applications in the container can be run directly on the MIG instance through the NVIDIA Container Runtime.

Five, the third generation interconnect technology: let GPU interconnect speed up
< / p>

MIG is the main driving force of scale-out, and the realization of scale-up requires better communication technology, that is, the “highway” between GPU and GPU-NVLink and NVSwitch.

Because of the limited bandwidth of standard PCIe connections, it usually causes a bottleneck in multi-GPU systems. High-speed, direct GPU-to-GPU interconnection technology NVLink came into being.

NVLink can connect multiple NVIDIA GPUs into a giant GPU to run, thereby providing efficient performance expansion on the server. The A100 uses NVLink’s GPU to GPU bandwidth is much faster than PCIe.

There are 12 third-generation NVLink connections in the A100, and the rate of each differential signal line can reach 50 Gb / s, which is almost twice that of the V100.

Each NVLink link has 4 pairs of differential signal lines in each direction, so the unidirectional communication capacity is 50×4 ÷ 8 = 25 GB / s, and the bidirectional is 50 GB / s. The total bandwidth of 12 third-generation NVLinks can reach 600 GB / s, which is twice that of V100.

In contrast, there were 6 NVLinks in the previous generation V100, and each NVLink had 8 pairs of differential signal lines in each direction, with a total bandwidth of 300 GB / s.

The NVLink on each GPU can be connected to other GPUs and switches at high speed. In order to expand to a larger system, NVIDIA NVSwitch is required to integrate multiple NVLinks.

NVIDIA NVSwitch is a node switching architecture based on NVLink ’s advanced communication capabilities, which can support 8 to 16 fully interconnected GPUs in a single server node, making AI performance sufficient to more efficiently expand to multiple GPUs.

The third generation NVSwitch is a 7nm chip with 6 billion transistors and 36 ports, which is twice the number of V100 ports; the total aggregate bandwidth is 9.6 TB / s, which is twice the total aggregate bandwidth of V100.

NVLink and NVSwitch technologies can provide higher bandwidth, more links, and improve the scalability of multi-GPU system configurations. They have made outstanding achievements in a series of boards, servers, and supercomputer products equipped with NVIDIA GPUs.

The third-generation NVIDIA NVLink and NVSwitch enable high-speed communication between multiple A100 GPUs in the new NVIDIA DGX, HGX, and EGX systems.

Taking DGX A100 as an example, the device uses AMD Rome CPU, 8 A100 GPUs, 6 NVSwitch chips, and 9 Mellanox ConnectX-6 200Gb / s network interfaces.

With the latest InfiniBand and Ethernet solutions connected by NVIDIA NVLink, NVSwitch and Mellanox, the A100-based system can be expanded to tens, hundreds, or thousands of A100s for computing clusters, cloud instances, or ultra-large supercomputers. In order to meet the acceleration needs of many types of applications and workloads.

Six. Conclusion: Towards the Next Era of Computing

From the previous generation of V100, known as “the strongest AI chip on the ground”, to the newly released AMP architecture GPU, we can see that NVIDIA AI hardware thinking is gradually tilting toward specialization.

In particular, the optimization of the calculation unit structure in the GA100 architecture, including support for new precision and structural sparseness, is essentially writing about the characteristics of AI and HPCchapter.

As NVIDIA has emphasized in recent years, it has evolved from a pure graphics company to a series of AI and HPC computing solutions providers. Whether it is the upgrade of computing and memory structure, or the iterative evolution of interconnection technology, it is inseparable from the strong research and engineering capabilities accumulated by NVIDIA.

The greater computing power brought about by these technological advances will catalyze the innovative research and application process in many fields such as AI, 5G, data science, robotics, genomics, financial analysis, etc.

When many companies are also aiming to surpass the NVIDIA V100 computing power, NVIDIA has rushed to the next era of computing.

This article comes from the WeChat public account: core stuff (ID: aichip001) , author: Xinyuan

domeet webmaster