In our last article, we discuss the importance of mixed-precision when making deep learning (DL) computer purchases. Here, we will dig deeper into the concept and see how it can reduce the memory requirement at most to half while improving speed 2–4x.
Yet, it will not suffer any accuracy loss.
Let’s get a case study. BERT trains a sequence of vector representations for a sequence of words.
Google positioned BERT as one of its most important updates to the searching algorithms in understanding users’ queries. In addition, BERT is good at representing the latent features of input word sequences that can be purposed for different NLP tasks including question and answer.
Memory Requirement for Google BERT implementation
According to Google, for its GPU implementation, the BERT-Base model requires at least 12GB for the fine-tuning. If the larger BERT-Large model is used, it cannot be trained with 12–16 GB GPU with its current codebase. But this depends on implementation strongly. In later NVidia implementation, with gradient accumulation and AMP, a single 16GB GPU can train BERT-large with the 128-word sequence with an effective batch size of 256 by running batch size 8 and accumulation steps equals 32.
But let’s focus on Google implementation for now. Google performs the GPU training using Titan X GPU with 12GB memory. Here is the batch size used for the corresponding length of the word sequence (seq length). In BERT, because of the attention mechanism, the complexity of the model is proportional to the quadratic of the word sequence length.
For the larger BERT-large model, when the word sequence length goes beyond 320, it cannot even fit a single data sample into the GPU memory for training. In other cases, the batch size becomes too small to achieve the right accuracy.
Fortunately, there are memory optimization techniques including gradient accumulation (generic example and concept) which accumulate gradients for multiple mini-batches before weight updates. So we just need enough GPU memory for a mini-batch. The code to implement gradient accumulation is relatively easy. Another optimization method is the gradient checkpointing which re-computes the activations smartly in the backward pass instead of storing the result in memory. Alternatively, we can break down the model into subcomponents (model parallelism) that can fit into the memory. These approaches trade memory footprint with computation and memory load. In addition, it requires some code change even the gradient accumulation is relatively simple.
Mixed Precision Training
When computers had only megabytes of RAM, the concept of doubling the RAM by compressing memory was once considered. In GPU, we do not double the GPU memory virtually but we can shrink the data size. In specific, instead of using 32-bit floating-point operations, we use 16-bit math. This reduces the memory footprint and the memory load. The half-precision calculation is also faster. Since there are operations remained in 32 bit, the saved memory will be somewhere between 1/3 and 1/2.
Implementing Mixed Precision Training
Below is the general algorithm for mixed-precision.
It involves the casting of weights from FP32 to FP16 before the forward pass.
As shown above, FP16 will only cover a sub-range of FP32. Values outside will be truncated. In specific, when values are too small, they truncate to zero in FP16.
After the casting, both the forward and the backward pass are done in FP16 with faster operation and less memory bandwidth. However, the gradient can be very small in particular after multiplying it with the learning rate. Therefore, a master weight copy in FP32 is available and the weight update is done in FP32. Then it casts back to FP16 when performing the forward and backward pass. This allows the master weight to grow or shrink continuously even the casted weight value may be the same.
However, the gradient in the backward pass can be too small and truncated to 0 in FP16 calculation. The following shows that 50% percentage of gradients can be ignored in the SSD model in object detection.
This is a large percentage. To correct it, we multiply the loss by s (s=8 in this model) before the backward pass and then scale it back in the weight update. This allows a small gradient to continue to be backpropagated.
There are other operations that required 32-bit storage. In specific, all the statistic information, like means and variances, used in the batch normalization will be in 32-bit to allow gradually but small changes.
Automatic Mixed Precision (AMP)
Likely, you don’t want to implement mixed-precision yourself. Fortunately, this can be provided under the hood in many DL platforms using Nvidia AMP (Automatic Mixed Precision). The first one is the code needed for TensorFlow and the second one is for PyTorch using the APEX extension.
More information in using AMP can be found here.
Stay tuned and we will look into the implementation of BERT with mixed precision closer.