Have you seen the method of training the depth model on the number one supercomputer, the method of training the model in the 2.76 million V100 GPU?

Editor’s note: This article is from WeChat public account “Heart Heart” (ID: almosthuman2017), author: Nouamane Laanait, Joshua Romero, etc., Almost Human compile .

Distributed computing does require training on many GPUs, but have you seen the method of training the depth model on the number one supercomputer, the training method of the 2.76 million V100 GPU training model? Importantly, with the new communication strategy, so many GPUs can achieve near-linear acceleration ratios, and the research at Oak Ridge National Laboratory and Nvidia is really Amazing.

Thesis link: https://arxiv.org/pdf/1909.11150.pdf

In this paper, the researchers introduced a new communication strategy in synchronous distributed DL, which consists mainly of gradient reduction orchestration and gradient tensor grouping strategy. These new technologies create the perfect overlap between computing and communication, and complete near-linear GPU expansion.

In other words, in the Summit overclocking, it can effectively train the model with 2.76 million V 100 GPUs, and the expansion factor reaches an astonishing 0.93. What is the concept of 0.93? We can learn about it through TensorFlow’s official Benchmark. As shown below, 60 GPUs can ideally achieve 60x acceleration, but the previous TF training ResNet-152 can only achieve 50x acceleration.

The world's first super-running deep learning model, 27,600 V100 GPUs extend distributed training to the extreme

Image source: https://www.tensorflow.org/guide/performanCe/benchmarks

This is only 60 GPUs, and its expansion factor is only 0.83. In addition, if the GPU expands to 1000 or more, this coefficient will drop dramatically.

Although many distributed training strategies have recently achieved near-linear expansion, such as DoubleSqueeze (ICML 2019) proposed by researchers such as Ring All Reduce or Liu Wei, etc., but this paper can be super-calculated at 27,600. Near-linear acceleration performance on a block V 100 GPU is still very rare.

In addition, it is important that researchers demonstrate the power of large-scale distributed DL training in scientific computing problems, which can make better use of the ability of supercomputing. This article does not focus on this part, and interested readers can refer to the original paper.

What is distributed training

Intuitively, distributed training is simply extended from one GPU to multiple GPUs, but with a variety of questions: how are models and data divided? How does the gradient propagate and how is the model updated?

In distributed computing, we can generally treat different computers as different computing nodes, which are connected by the Internet to form an entire computing cluster. Now it’s important to find a way to “combine” computing power with model training, which is a distributed strategy. The most intuitive two strategies may be model parallelism and data parallelism, which consider how to segment the model training process from different perspectives.

The model parallelism refers to logically splitting the model into different parts and then deploying them to different compute nodes, just like AlexNet. This method mainly solves the problem that the amount of model parameters is too large and so on. Data parallelism refers to splitting a data set into different sub-modules and feeding them into different nodes. Unlike models, data is naturally parallelizable, so most of the problems in practice use data parallelism.

In data parallelism, there are a variety of parallel strategies, such as synchronous SGD and asynchronous SGD. They are the most common distributed training methods, and these modes can be called directly by frameworks such as TensorFlow and PyTorch.

The world's first super-running deep learning model, 27,600 V100 GPUs extend distributed training to the extreme

Synchronize the update process between SGD and asynchronous SGD, where synchronous SGD will wait for all servers to finishInto the calculation, asynchronous SGD different servers will update the parameters independently.

Where is the problem of data parallelism?

Although data parallelism is the most widely used method, its shortcomings are obvious. As a distributed strategy, data parallelism requires a very large amount of traffic, and it is necessary to perform a blocking communication set to synchronize DNN gradients during training. In a single training step, suboptimal overlap between computational and communication operations can result in communication overhead or inefficient parallel learning of data.

In small to medium-sized systems with 10 to 100 GPU/TPU accelerators, these expansion inefficiencies can be difficult to detect and systematically optimized due to system noise and load changes. However, it should be noted that even a 5-10% expansion inefficiency will accumulate in a large number of training steps and training processes, further deepening the damage of deep learning to the environment.

The inefficiency of data parallel implementation is most evident in large-scale systems. For example, in the acceleration system of 1000-10000 chips, many distributed strategies will cause a lot of losses. In this paper, the researchers suggest that supercomputers are ideal systems for developing and testing gradient reduction strategies that enable data parallel near-linear expansion.

The parallel expansion of data to large-scale supercomputer systems is also driven by the latter’s traditional load, including scientific numerical simulation. It is especially important to apply deep learning to scientific simulations to speed up execution and reduce computational demands, often using DNNs to approximate long-standing inverse problem solutions. In this paper, the researchers show the first step in this direction by improving the gradient reduction strategy.

Parallel measurement data with super calculations

All measurements shown in this article were obtained at Oak Ridge National Laboratory’s Supercomputer Summit, and it is currently the number one supercomputer in the world.

The Summit system consists of 256 racks with IBM Power System AC922 compute nodes (a total of approximately 4600 nodes), each with 2 IBM POWER9 CPUs and 6 NVIDIA V100 GPUs.

The researchers looked at a data parallel approach for DNN distributed training. The largest distributed DNN training is currently implemented by Kurth et al. (2018) to learn a segmentation task on climate simulation data. They use a modified DNN segmentation model (DeepLabV3), which has a single GPU computing performance of 38.45 TFLOP_16 (16 refers to float 16 precision), which is equivalent to the V100 GPU theoretical peak speed of 31%.

In Figure 1 below, the researchers measured the scalability of hierarchical allreduce up to 1024 Summit nodes. This sublinear expansion is significant because of the inefficient collaboration of workers on large nodes, resulting in poor overlap between communication and computation.

The world's first super-running deep learning model, 27,600 V100 GPUs extend distributed training to the extreme

Figure 1: The effect of different gradient reduction strategies on scaling efficiency.

Core Idea: Gradient Reduction Strategy

To achieve near-linear expansion on super-large computational power, you need to build a better distributed strategy. Researchers say their main contribution is to implement a new gradient reduction strategy that achieves optimal overlap between computation and communication, which allows GPU scalability to reach new SOTA.

Intuitively, the gradient reduction strategy includes (1) lightweight server coordination technology (BitAllReduce) and (2) gradient tensor grouping strategy (Grouping). These two orchestration strategies enhance the distributed deep learning performance with Horovod at different levels.

The effects of BitAllReduce and Grouping on GPU expansion efficiency are shown in Figure 1 by black and red lines, respectively. At the same time, they bring more than 8 times the scaling efficiency (Figures 1, 2). These gradient reduction strategies are independent of the computing platform and do not make any assumptions about the connected network topology of the nodes.

The world's first super-run deep learning model, 27,600 V100 GPUs extend distributed training to the extreme

Figure 2: The Horovod timeline shows the enhancements made by orchestrating Bitvector Allreduce and Grouping, where the blue vertical line is the cycle mark.

First, Bitvector AlLreduce fixes the coordination of gradient tensor reduction through collection (see Figure 3). The main idea of ​​Bitvector Allreduce is to use the cached metadata and associate it with each gradient tensor, allowing local access to each MPI-rank to coordinate the collective operation of the collective operation. Essentially, we replaced Horovod’s original server policy with a single collection (MPI Allreduce on Bitvector) (see Figure 3b).

The world's first super-running deep learning model, 27,600 V100 GPUs extend distributed training to the extreme

Figure 3: Coordination policy comparison, 3a: In the original coordination strategy, Rank 0: (i) collects the request T_n; (ii) determines the general request in all levels; (iii) constructs the association response R_n; (iv) Broadcast the ordered list of responses to all levels of execution. 3b: Improved coordination strategy, each level checks if the response is in the cache and sets the bit in Bitvector accordingly.

Secondly, the researchers introduced a “grouping” scheme, which uses the gradient tensor as a graph coloring algorithm. Essentially, each MPI level colors the nodes based on its computational dependency graph, where the nodes are equal to the gradient tensor. Then we can group the gradient tensors according to different colors (as shown in Figure 4). Then, the Collective operation is performed only for the groups that are all ready for all levels. One of the advantages of “grouping” is to give users the flexibility to develop collections in a way that develops the DNN model architecture for greater efficiency.

The world's first super-running deep learning model, 27,600 V100 GPUs extend distributed training to the extreme

Figure 4: Group icon. The task diagram built by the build request T_n is shown on the left, where the different tasks are different nodes.The task map shows the requests that Horovod sees in the three subrings through the dashed box. Nodes can be divided into two groups by different color representation tasks: blue solid nodes and green dashed nodes.

Finally, the researchers found that Grouping and Bitvector Allreduce can be used independently, but combined use can achieve more performance gains. Here is a brief introduction to the idea of ​​the gradient reduction strategy. More implementation details can be found in the fourth chapter of the original paper.

Test effect

An important metric for executing application efficiency on supercomputers is the measured power consumption. In particular, using a blocking collection such as Allreduce causes all operations performed on the GPU/CPU to stop until the result from the collection is returned.

Figure 5 below shows the power consumption of the major hardware components on the Summit when the author used Bitvector Allreduce and Grouping for distributed training.

The world's first super-running deep learning model, 27,600 V100 GPUs extend distributed training to the extreme

Figure 5: Power analysis of Summit for distributed training on 4600 nodes. Power consumption information is collected from Summit’s main hardware components (GPU, CPU, etc.) in a distributed training session.

In addition to power consumption, the authors also outline the computational performance of distributed training using a new gradient reduction strategy. The performance measurements given include: (1) I / O (data reading and model checkpoint writing), (2) DNN forward and back propagation performed calculations, and (3) embedded calculations Communication operation.

The world's first super-running deep learning model, 27,600 V100 GPUs extend distributed training to the extreme

In Table 1, the authors summarize the implementation of the application on a single Summit node using the performance assessment method described previously (a trainingMathematical operations, time, and overall performance when practicing.

Finally, using the communication strategy described in Section 2.3, the researchers were able to achieve an expansion efficiency of 0.93 on 4,600 nodes during distributed deep learning (Figure 6) and reached 1.54(2) (2.15(2) ))) EFLOPS_16.

The world's first super-running deep learning model, 27,600 V100 GPUs extend distributed training to the extreme

Figure 6: Extended Efficiency and Continuous Performance of Distributed Deep Learning with Extended Gradient Reduction Strategy to 27,600 V100 GPUs