Improve Deep Learning Models performance & deep network tuning (Part 6)
Once we have our models debugged, we can focus on the model capacity and the tuning. In this section, we will discuss how to improve the performance of a deep learning network and how to tune deep learning hyperparameters.
The 6-part series for “How to start a Deep Learning project?” consists of:
· Part 1: Start a Deep Learning project.
· Part 2: Build a Deep Learning dataset.
· Part 3: Deep Learning designs.
· Part 4: Visualize Deep Network models and metrics.
· Part 5: Debug a Deep Learning Network.
· Part 6: Improve Deep Learning Models performance & network tuning.
Increase model capacity
To increase the capacity, we add layers and nodes to a deep network (DN) gradually. Deeper layers produce more complex models. We also reduce filter sizes. Smaller filters (3x3 or 5x5) usually perform better than larger filters.
The tuning process is more empirical than theoretical. We add layers and nodes gradually with the intention to overfit the model since we can tone it down with regularizations. We repeat the iterations until the accuracy improvement is diminishing and no longer justify the drop in the training and computation performance.
However, GPUs do not page out memory. As of early 2018, the high-end NVIDIA GeForce GTX 1080 TI has 11GB memory. The maximum number of hidden nodes between two affine layers is restricted by the memory size.
For very deep networks, the gradient diminishing problem is serious. We add skip connection design (like the residual connections in ResNet) to mitigate the problem.
Model & dataset design changes
Here is the checklist to improve performance:
- Analyze errors (bad predictions) in the validation dataset.
- Monitor the activations. Consider batch or layer normalization if it is not zero centered or Normal distributed.
- Monitor the percentage of dead nodes.
- Apply gradient clipping (in particular NLP) to control exploding gradients.
- Shuffle dataset (manually or programmatically).
- Balance the dataset (Each class has the similar amount of samples).
We should monitor the activation histogram before the activation functions closely. If they are in very different scale, gradient descent will be in-effectively. Apply normalization. If the DN has a huge amount of dead nodes, we should trace the problem further. It can be caused by bugs, weight initializations or diminishing gradients. If none is true, experiment some advance ReLU functions like leaky ReLU.
Dataset collection & cleanup
If you build your own dataset, the best advices are research hard on how to collect samples. Find the highest quality sources. But filters out all those irrelevant to your problem. Analyze the errors. In our project, images with highly entangled structure perform badly. We can change the model by adding convolution layers with smaller filters. But the model is already too hard to train. We can add more entangled samples for further training. But we have plenty already. Alternatively, we can refine the project scope and narrow down our samples.
Collect labeled data is expensive. For images, we can apply data augmentation with simple techniques like rotation, random cropping, shifting, shear and flipping to create more samples from existing data. Other color distortion includes hue, saturation, and exposure shifts.
We can also supplement training data with non-labeled data. Use your model to classify data. For samples with high confidence prediction, add them to the training dataset with the corresponding label predictions.
Learning rate tuning
Let’s have a short recap on tuning the learning rate. In early development, we turn off or set to zero for any non-critical hyperparameters including regularizations. With the Adam optimizer, the default learning rate usually works well. If we are confident in the code but yet the loss does not drop, start tuning the learning rate. The typical learning rate is from 1 to 1e-7. Drop the rate each time by a factor of 10. Test it in short iterations. Monitor the loss closely. If it goes up consistently, the learning rate is too high. If it does not go down, the learning rate is too low. Increase it until the loss prematurely flattens.
The following is a real example showing the learning rate is too high and cause a sudden surge in cost with the Adam optimizer:
In a less often used practice, people monitor the updates to W ratio:
- If the ratio is > 1e-3, consider lowering the learning rate.
- If the ratio is < 1e-3, consider increasing the learning rate.
Once the model design is stabled, we can tune the model further. The most tuned hyperparameters are:
- Mini-batch size
- Learning rate
- Regularization factors
- Layer-specific hyperparameters (like dropout)
Typical batch size is either 8, 16, 32 or 64. If the batch size is too small, the gradient descent will not be smooth. The model is slow to learn and the loss may oscillate. If the batch size is too high, the time to complete one training iteration (one round of update) will be long with relatively small returns. For our project, we drop the batch size lower because each training iteration takes too long. We monitor the overall learning speed and the loss closely. If it oscillates too much, we know we are going too far. Batch size impacts hyperparameters like regularization factors. Once we determine the batch size, we usually lock the value.
Learning rate & regularization factors
We can tune our learning rate and regularization factors further with the approach mentioned before. We monitor the loss to control the learning rate and the gap between the validation and the training accuracy to adjust the regularization factors. Instead of changing the value by a factor of 10, we change that by a factor of 3 (or even smaller in the fine tuning).
Tuning is not a linear process. Hyperparameters are related, and we will come back and forth in tuning hyperparameters. Learning rate and regularization factors are highly related and may need to tune them together sometimes. Do not waste time in fine tuning too early. Design changes easily void such efforts.
The dropout rate is typically from 20% to 50%. We can start with 20%. If the model is overfitted, we increase the value.
- Activation functions
Sparsity in model parameters make computation optimization easier and it reduces power consumption which is important for mobile devices. If needed, we may replace the L2 regularization with the L1 regularization. ReLU is the most popular activation function. For some deep learning competitions, people experiment more advanced variants of ReLU to move the bar slightly higher. It also reduces dead nodes in some scenarios.
There are more advanced fine tunings.
- Learning rate decay schedule
- Early stopping
Instead of a fixed learning rate, we can decay the learning rate regularly. The hyperparameters include how often and how much it drops. For example, you can have a 0.95 decay rate for every 100,000 iterations. To tune these parameters, we monitor the cost to verify it is dropping faster but not pre-maturely flatten. Some trainings may use a specific schedule. For example, reduce the learning rate by 10x after 1M iterations and another 10x after 1.2M iterations.
Advance optimizers use momentum to smooth out the gradient descent. In the Adam optimizer, there are two momentum settings controlling the first order (default 0.9) and the second order (default 0.999) momentum. For problem domains with steep gradients like NLP, we may increase the value slightly.
Overfitting can be reduced by stopping the training when the validation errors increase persistently.
However, this is just a visualization of the concept. The real-time error may go up temporarily and then drop again. We can checkpoint models regularly and log the corresponding validation errors. Later we select the model.
Grid search for hyperparameters
Some hyperparameters are strongly related. We should tune them together with a mesh of possible combinations on a logarithmic scale. For example, for 2 hyperparameters λ and γ, we start from the corresponding initial value and drop it by a factor of 10 in each step:
- (e-1, e-2, … and e-8) and,
- (e-3, e-4, … and e-6).
The corresponding mesh will be [(e-1, e-3), (e-1, e-4), … , (e-8, e-5) and (e-8, e-6)].
Instead of using the exact cross points, we randomly shift those points slightly. This randomness may lead us to some surprises that otherwise hidden. If the optimal point lays in the border of the mesh (the blue dot), we retest it further in the border region.
A grid search is computationally intense. For smaller projects, this is used sporadically. We start tuning parameters in coarse grain with fewer iterations. To fine-tune the result, we use longer iterations and drop values by a factor of 3 (or even smaller).
In machine learning, we can take votes from a number of decision trees to make predictions. It works because mistakes are often localized: there is a smaller chance for two models making the same mistakes. In DL, we start training with random guesses (providing random seeds are not explicitly set) and the optimized models are not unique. We pick the best models after many runs using the validation dataset. We take votes from those models to make final predictions. This method requires running multiple sessions and can be prohibitively expensive. Alternatively, we run the training once and checkpoints multiple models. We pick the best models from the checkpoints. With ensemble models, the predictions can base on:
- one vote per model,
- weighted votes based on the confidence level of its prediction.
Model ensembles are very effective in pushing the accuracy up a few percentage points in some problems and very common in some DL competitions.
Instead of fine-tuning a model, we can try out different model variants to leapfrog the model performance. For example, we have considered replacing the color generator partially or completely with an LSTM based design. This concept is not completely foreign: we draw pictures in steps.
Intuitively, there are merits in introducing a time sequence method in image generation. This method has proven some success in DRAW: A Recurrent Neural Network For Image Generation.
Fine tune vs model improvement
Major breakthroughs require model design changes. However, some studies indicate fine-tuning a model can be more beneficial than making incremental model changes. The final verdicts are likely based on your own benchmarking results.
You may have a simple question like should I use Leak ReLU. It sounds so simple but you will never get a straight answer anywhere. Some research papers show empirical data that leaky ReLU is superior, but yet some projects see no improvement. There are too many variables and many projects do not have the resources to benchmark even a portion of the possibilities. Kaggle is an online platform for data science competitions including deep learning. Dig through some of the competitions and you can find the most common performance metrics. Some teams also publish their code (called kernels). With some patience, it is a great source of information.
DL requires many experiments and tuning hyperparameters is tedious. Creating an experiment framework can expedite the process. For example, some people develop code to externalize the model definitions into a string for easy modification. Those efforts are usually counterproductive for a small team. I personally find the drop in code simplicity and traceability is far worse than the benefit. Such coding makes simple modification harder than it should be. Easy to read code has fewer bugs and more flexible. Instead, many AI cloud offerings start providing automatic hyperparameters tuning. It is still in an infant state but it should be the general trend that we do not code the framework ourselves. Stay tuned for any development!
Now, you have your model tuned and ready to deploy. If you have any more tuning tips, feel free to share with us in the comment. I hope that you feel the 6-part series is useful.
There are many problems that can be solved by deep learning: much more than you may imagine. Can a designer pass a visual mockup design to you and you can generate the HTML by deep learning automatically? Impossible? Do a Google search on pix2code or sketch2code!