Buy a Deep Learning Computer — David v.s. Goliath

Image for post
Image for post

MVP in the tech world is not “Most Valued Player”. Minimum Viable Product (MVP) means testing out hypotheses, finding out the optimum product-market fit and doing what it matters. In other words, don’t over plan, learn from your experience and scale the solution.

I have delayed my personal Deep Learning computer upgrade for the last 12 months. I am waiting for the right timing to build a new 2020 MVP. In this article, we will discuss key decision points and costs for acquiring a Deep Learning (DL) computer. Also, we will look into alternatives like the cloud solution. But for people that just want a quick answer, it will be at the end of the article.

When you build a new million-dollar home, you may plan for an $850K house with the rest as the contingency fund. A UCLA management class once taught me the other way. Design a $1.2M house and try hard to pay for the same design for $1M. So, let’s look at a state of the art machine first.

Source DGX SuperPod

Nvidia has a cool machine called DGX SuperPod. It has 96 DGX-2H systems with a total of 192 CPUs, 1536 GPUs, and 1.4 TBs GPU memory. This machine trains a very important NLP model (BERT-large) in 47 minutes. It will take a single GPU system months otherwise.

Image for post
Image for post
Source

This system contains 96 DGX-2 pods. Each pod costs about $400K with 16 Nvidia Tesla V100 GPUs. Can this be my million dollar design? This will be my David v.s. Goliath story.

What are other choices?

If you never take any DL course, don’t buy a new machine just to do DL coding. Your first DL code should be on Google Colaboratory or a gaming machine you already had. For your final project or the poster presentation, you likely need something more powerful. If you don’t have an Nvidia video card, the cloud solution is a good alternative. The first possibility is Amazon AWS. The diagram below is the price for the entry-level machine with GPUs. Training a model with a K80 GPU for a couple of days will cost $43.

Image for post
Image for post
Source: Amazon AWS

Fortunately, Amazon provides a spot instance that is about 1/3 of the price above.

Image for post
Image for post

The catch is Amazon can kill your instance (virtual system) anytime during high demand. With EBS storage and simple DL code change, you can manually restore and resume the training in a new instance easily. Gladly, this does not happen very often. The instance price goes up quickly with more GPUs. But with complex models that requiring many GPUs to train, this is an excellent solution.

A new cloud alternative is Google Cloud TPU.

Image for post
Image for post
Source

I once looked at the performance benchmarks for Google TPU (Tensor Processing Unit). When I see the unusual performance standout in the Transformer model, it is clear that TPU specializes in matrix operation. Mathematically, Tensor is a scalar or a k-D matrix. That is why the processor is called TPU. It is a processor specialized in matrix operations for linear algebra.

Both cloud offerings will make a big difference for training models that take weeks or months. We can start the development with a standalone computer with GPU(s). Then we can switch to clouding computing for more extensive training if needed.

Warning

All the pricing, performance and model information here is just a snapshot. The purpose of this article is to give you enough information to guide you on a future purchase.

What system should I buy?

Personally, I use a standalone system for development and use cloud solutions for training very complex models. But the cost can run up fast. So, let’s focus on buying a DL computer for now. But no good advice can be given without proper context.

So I will try a different approach in answering this question. In the previous section, I cover the case for beginners with free options and super hard domains using cloud computing. For the rest, I will roughly focus on 5 price ranges for the whole computer and discuss what can you do with it.

Sub $1000

For someone with a GPU video card that has 4 to 6GB, you will manage to complete DL course assignments. I categorize this in the sub $1000 PC range. You can use many popular models like ResNet 50 in making predictions. However, you cannot retrain or refine them without significant simplification or lost in validation accuracy.

$1700-$2400

Next, you get serious. You want to finish a project that can wow people in a poster presentation. It may take too long to train your models and you want to shift most or all experiments to local machines. That is the $1700 gaming machine range that you may look into. For this level, you can get a nice CPU, 32 GB RAM, a GPU with 8GB memory, and 2–3 TB storage. Another $700 can get you the Nvidia 2080TI with 11GB GPU memory. The performance jumps close to 100% when you move from 2070 to 2080TI.

Let’s do some math in seeing what can you do in an industrial-strength DL model with this higher-end 2080 TI. ResNet 50 is quite popular in computer vision. It has a total of 23,587,712 parameters. For each sample, there are about 16 million activations to be recorded in the forward pass to perform the backpropagation. In the ResNet paper, the experiment uses a batch size of 256 with a total of 600,000 iterations. That is 153.6M images processed. A 2080TI GPU can execute 294 images per second in ResNet 50 (see credits for details). The whole training can be done in 6 days. But, to fit all model parameters and activation results of a batch into an 11GB GPU, we may reduce the batch size to 32. Fortunately, the mixed-precision optimization will speed up the training (3.3x) and cut down the memory requirement. But we will delay this discussion later. For now, you should have a feel about what money can buy you now.

$5,400 with dual GPUs

Now, you are very serious in DL. You develop and train new models that require many experiments. It may take hundreds of experiments before finding the right one. You want fast iterations and can’t wait for hours or days before knowing what should be the next moves. You can plug in another GPU card into the machine which will cost you from $500 to $1K more. Unfortunately, in some higher-end GPUs, this may not work. Major PC vendors may require you to bring the machine to the next level first with a bigger power supply, the next level CPU and wider PCIe lanes. For example, a dual 2080TI computer with 64G CPU memory from a PC vendor will cost about $5.4K. That increases from a $2.5K single 2080TI machine. Now, you start witnessing the non-linear price increase with system capability. Fortunately, for a dual 2080 machine, you may still be able to get it around $2.5K.

Do you double the performance when adding a new card? The answer depends on the DL models but the good news is most models scale pretty nice in this range. Many DL software platforms have been optimized to scale reasonably linear for many popular models in this zone. The one need to watch out is the Transformer model in NLP. This model basically composes of fully-connected layers. It takes more optimization techniques to increase data parallelism.

$8K+ with multiple GPUs

You have an amazing startup idea. But it requires long training with large DL models. Now, you are in a zone where most big PC vendors do not have a product for you. You will get it from specialized stores. A 4x 2080TI machine starts around $8K. Yes, most DL models still scale reasonably well in this zone. You can get close to 4x improvement.

Beyond and above

The next level will be targeted for corporations and data centers. It will involve high-end GPUs in a cluster. These machines are expensive. The machine in AWS belongs to this class. One key advantage will be the amount of memory for the GPU which is likely 24 GB+ and you can train a model with many GPUs. For most people, we can access these machines through the cloud service and rent it by the hours. Soon, we may even rent an 800 NVIDIA V100 Tensor Core GPUs supercomputer from Microsoft Azure.

How much to spend?

You may still ask what price range should you spend. For the intermediate to advanced users, we can group them into two groups: application developers that don’t do complex model training and model researchers. For the former group, they may be happy to take pre-trained DL models and focus on creating solutions from predictions only — no complex training involved. Since many DL software platforms come with a zoo of well-known models, these engineers are well covered. They can get away with GPUs with less memory. Otherwise, to train a model with a batch size K, it usually takes approximately K times GPU resource.

We usually don’t retrain ResNet or VGG16. But if you are solving a problem domain with very different training data, you need to retrain the model. In other situations, you develop custom models or optimize pre-trained models with transfer learning. Some people laughed when ResNet was introduced with 152 layers. Now, models are getting very sophisticated with deep layers in high-dimension latent space. In short, the GPU performance and memory requirement for model researchers will be far higher than application development.

Time what is the turn around time for your daily experiment. If some major one cannot be finished overnight, you may need to up your game a little bit more. It hurts how you iterate your solutions and reduces your chance of success pretty significantly. Again, there are too many moving parameters here to make a general suggestion. For example, professionals may value their time more than students and willing to pay more with less performance return.

Next, we will look into each PC component closely and discuss the available choices.

Choosing Components

GPU

The two major decisions for GPU are performance and memory. Then, performance/cost ratio and overheat are two other issues that need attention.

The GPU to CPU ratio in the DGX-2H system is 8:1. Skip connections allow us to build models in far greater depth. As we expect our intended model training to be GPU-bound, our first priority will be the choice of GPU.

Which DL processor do we pick? Nvidia GPU supports DL the best. You can access Google TPU from cloud service. But you cannot buy a TPU. For the last couple of years, there is nice traction from AMD in providing DL software supports. The strategy is to use the ROCm platform in providing a device and software-platform agnostic environment. One key idea is to use HIP (Heterogeneous-Computing Interface) to ease the conversion of Cuda applications to C++ code. But converting all the TensorFlow or PyTorch codebase into HIP may not sound as simple as it may look. As the devil is on the details, the overall strategy may need to be double down before becoming a compelling alternative. That is very bad news for Mac users like me. For Mac users, you may need to build a separate Linux box.

Many DL models take a long time to train. So extra performance capacity in GPU will not be wasted.

GPU memory is another major concern. ResNet 50 has 24M model parameters. To perform the backward gradient propagation, it needs memory to store all the activations in each layer for each training sample. Transformer models that become popular in NLP take a lot of memory. Memory is a thorny issue in NLP problems since their cost function tends to have large curvature and needs a larger batch size. This challenge may be summarized by Google in its open-source BERT implementation in 2018.

All results on the paper were fine-tuned on a single Cloud TPU, which has 64GB of RAM. It is currently not possible to re-produce most of the BERT-Large results on the paper using a GPU with 12GB - 16GB of RAM, because the maximum batch size that can fit in memory is too small. We are working on adding code to this repository which allows for much larger effective batch size on the GPU.

But the memory requirement strongly depends on implementation. Later, NVidia has an implementation that trains the model with 16GB. Nevertheless, the trend of increasing GPU memory requirement is likely to continue. The human brain has 86 billion neurons with as many as 1,000 trillion synaptic connections. The current GPU memory maybe just like the 128K RAM in the first Apple Macintosh.

High-end GPU increases capacity in both performance and memory. But it does come with a price. Price is double from 8GB GPU ($500-$700) to 11GB GPU ($1,100) and then again to 24GB. The corresponding performance improvement will be like 100% from 2070 to 2080TI, 50% from 2080 to 2080TI and 20% from 2080TI to Titan RTX. If you don’t understand the DL concept I discussed, 8GB GPU (2070 super or 2080 super) is a good start if you want to invest your time in DL. Many people like the 11GB GPU (2080TI) for its speed. The key selling point for the 24GB is whether you can train the model with only 11GB memory. Again, if you don’t understand the DL information here, stay with the MVP principle. Nevertheless, these performance benchmarks strongly depend on their corresponding tests and coding. This should be a reference and your mileage may vary significantly.

And this information changes quickly with pricing. There are many good GPU performance benchmarks and price/performance analysis online. I have their links in the reference section. Spend a few minutes there and it should not be hard to figure the numbers you need. In recent years, model complexity grows faster than GPU improvement. So we can view the choice from the price point rather than speed and memory. There is $500–$700, $1,100 and $2,400 range for the GPU. You may stay within a comfortable price range in determining the GPU you want. For example, the recommended GPU in the $1K range may be 1080TI a couple of years ago and now it may be 2080TI for a similar price. Added capacity for new GPUs will not be wasted.

Training Case Study

Let’s get a real problem for our study here. BERT trains a sequence of vector representations for a sequence of words.

Image for post
Image for post

Google positioned BERT as one of its most important updates to the searching algorithms in understanding the user query. In fact, we would expect the use of BERT in many NLP applications including question and answer.

Pretraining a BERT-base model from scratch is expensive. It takes about 4 cloud TPUs for four days to learn how to convert a word sequence into a vector sequence. Happily, there are pre-trained models so we don’t need to deal with the pretraining. The second phase of the training involves transfer learning to train a model targeted for a specific NLP task. It requires a labeled dataset to fine-tuning the model. The fine-tuning is much shorter: within 1 hour on a single Cloud TPU, a single Cloud TPU contains 4 TPU chips. For a single high-end GPU, that may take a few hours.

There are two pre-trained BERT models.

Image for post
Image for post

The table below indicates an average of 2.7% improvement in GLUE (GLUE contains a collection of NLP tasks) when switching from BERT-Base to BERT-large. In DL, this level of improvement is significant as errors accumulate. But the memory requirement will likely push it towards 16GB.

Image for post
Image for post
Source

There are many trained models, like ResNet and Inception net, available for computer vision. Some models may be simpler than NLP but transfer learning is usually more complex than BERT. For me, I build custom models also. A GAN model may take days in some of my later stage experiments.

How much GPU memory?

I think 8G is a good start and 11G is nice. But knowing the minimum memory requirement is hard! It strongly depends on the models you want to train, the model versions and out-of-the-box code implementation. In NVidia implementation, with gradient accumulation and AMP, a single 16GB GPU can train BERT-large with the 128-word sequence with an effective batch size of 256 by running batch size 8 and accumulation steps equal 32.

But the hard part is not knowing how much memory to avoid out-of-memory. Instead, it is not easy to know what is the minimum memory to achieve a similar accuracy level when a larger size of memory is used. You may think this information is widely available for popular implementations but it is not in practice. In the next two sections, we will spend some time to address the memory issue.

Mixed Precision Training

When computers had only megabytes of RAM, the concept of doubling the RAM by compressing memory was once considered. In GPU, we do not double the GPU memory virtually but we can shrink the data size. In specific, instead of using 32-bit floating-point operations, we use 16-bit math. This reduces the memory footprint and the memory load. The half-precision calculation will be faster also.

Actually, the mixed-precision method improves performance 2–4x. The actual memory saving will be less than half because some operations still need to be stored and done in 32-bit to achieve high accuracy. Depends on the model, expect the memory improvement to be between 50% to 100%.

To learn more about the concept of mixed-precision, we have a simple article. Luckily, using common DL platforms, the mixed-precision method can be applied to different DL models with one or a few lines of code using the same batch size and hyperparameters. Even better, the trained model will have similar accuracy compared with 32-bit training.

Image for post
Image for post
Source

Mixed Precision Purchase Decision

Why does mixed-precision impact your computer purchase?

The Nvidia Turing architect uses Tensor Cores to accelerate the Mixed Precision training. We can view the Tensor Cores as executing a special instruction that multiplies and adds 4×4 matrices in mixed precision. Large matrics can be broke down into tiles and computed in a pipeline. Nvidia Tesla V100 comes with 640 Tensor Cores. i.e. it can compute 640 such calculations in parallel.

Image for post
Image for post
Source

You may not want to implement mixed-precision yourself. It will be more complex than you think. Fortunately, this can be provided under the hood in many DL platforms using Nvidia AMP (Automatic Mixed Precision). The first one below is the code needed for TensorFlow and the second one is for PyTorch using the APEX extension.

Image for post
Image for post
Modified from source

But you do need to choose the Volta or Turing architect. For example, choose 20x0 GPU over 10x0 GPU. Only these architects have Tensor Cores.

Below is the potential mixed-precision speedup in TensorFlow for common DL models.

Image for post
Image for post
Source

I spend a relatively large amount of discussion in mixed-precision because I want you to aware that your GPU may come with a turbo button. But you do need to push it. You will need the NGC TensorFlow container (NVIDIA GPU-Accelerated Containers) for TensorFlow and the NGC PyTorch container plus Apex library for the PyTorch.

Now, lets back to choosing the computer components.

Number of GPUs & PCIe Lanes

One good way to speed up training is by adding GPUs to a machine. But there are a few issues that you need to pay attention to. For dual and multiple GPUs, check with PC vendor websites to see whether you need a larger power supply (of course a spare PCIe slot for the video card). In some situations, it may push you to the next level of computers. The PC vendor may require you for an upgrade of the CPU and other components. This machine may eventually add an additional $2K excluding the new GPU card. This, however, depends on the GPUs. A PC vendor may offer dual 2080 GPUs with a mid-level gaming machine while it only offers dual 2080TI GPUs with a high-end gaming machine.

One most expensive upgrade is related to the CPU. People often suggest paying special attention to the number of PCIe lanes. A GPU card should utilize 16x or 8x PCIe lanes to the CPU (a.k.a. 16 lanes or 8 lanes highway). Theoretically, if you have 2 GPU cards, you want 32 CPU PCIe lanes allocated to the 2 GPUs. This minimizes any un-necessary theoretical bottleneck.

I personally bother by this claim. We should not quantify performance impact without proper context. The key question should be what is the average impact of the typical workload? Indeed, many benchmarks on gaming applications show 1–2% performance degradation only when 8 lanes per GPU are used. Other DL benchmarks may show very similar degradation.

Following the 16 lanes per GPU comes with a price. The mid-level gaming machines usually come with a processor with 16 CPU PCIe lanes only.

Image for post
Image for post
Source

You can go with high-end Intel processors like the X-series. They have 44 lanes.

Image for post
Image for post
Source

Some high-end AMD chips have 60 PCIe lanes. But both types of chips cost significantly more. In addition, even you have PCI-e x16 slots, they may not operate at x16 speed. You need the motherboard and the chipset designed to support 16x speed for that slot also. Indeed, major PC vendors may only offer those capabilities to their top-end gaming machines starting with a base price at $3K. As shown below in the Nvidia control panel, the GPU card’s bus type is PCIe x16 but the PCIe link is operated at x8 only.

Image for post
Image for post

When you put two 2080TI into the machine, many computer vendors believe you need a more powerful CPU with more PCIe lanes. This is part of the reason for the sudden price jump for the dual 2080TI system.

Since I may contradict other people’s recommendations, I will spend a little time explaining PCIe lanes if you decide to have it wider. First, your CPU lanes should be wide enough for your multiple GPUs. But don’t confuse it with the PCIe lanes in the motherboard chipset. CPU lanes are usually for connecting CPU to graphic cards. Sometimes, it can bypass the CPU in accessing the RAM. Other devices may connect to the PCIe lane in the chipset but all traffic will be funneled through DMI 3.0 with a maximum throughput of 3.93 GB/s. The following showing the 44x CPU lanes on the top left for the Intel X-series processor and the 24 PCIe from the Intel X229 chipset below it.

Image for post
Image for post
Source

As mentioned, you also need the motherboard’s PCIe slot to be designed with x16 speed. Many specifications may show you that they have two x16 PCIe slots. But as shown below, it does not mean they are operated at x16 speed. Consult your PC vendor for clarification if in doubt.

Image for post
Image for post
Source

Let’s recap. Each GPU should at least take away 8 CPU PCIe lanes. Whether you need more lanes is debatable. Performance scales close to linear at least with 1–4 GPUs for most DL models. We can add NVLink between NVidia GPU to facilitate direct communication between GPUs. But prepare to pay additional money excluding the GPU cards when jumping to 2 or 4 CPUs.

CPU & Memory

CPU is mainly for data preprocessing and data augmentation during the training. So get a reasonable processor but expect less return for GPU intense applications. I will not get top of the line CPUs. Instead, I will reinvest the money in other bottlenecks, in particular, the GPU. I will check out the CPU benchmarks to justify the upgrade cost. Again, your targeted applications should be GPU bound.

For memory, I start with 32GB. But make sure you have room to go up to 64GB later if needed. But memory compatibility and upgrade can be very tricky depending on your computer vendors. Do your homework carefully.

For 3–4 GPUs, you may need a more powerful CPU and more memory. But double the GPU does not require doubling CPU or memory. It may not even impact the CPU and memory choice for a dual GPU system. Many resources are shared within the same experiment. You likely have extra CPU to handle dual GPUs. For multiple GPUs, AMD processors can be appealing because of price and the number of CPU PCIe lanes if you do more than 2 GPUs.

Disk space

The ImageNet dataset is about 138GB in tar format. It will need about 300GB to download and to extract the file. The BERT dataset will takes 600GB for download and data preparation. A one 1TB M.2 PCIe NVMe SSD is a good start with a second cheaper 2–3 TB drive. If the price for the SSD drops, you may even want a larger SSD drive.

Power supply

A computer with a high-end GPU may be recommended by the GPU vendor for a 650W power supply. (Please, verify this information for your GPU.) This is one factor that you may need to be careful about multiple GPUs. The configuration tool on the PC purchase website may warn you if you need more power. But if you purchase a second card separately, you need to review whether the power supply has enough power left for the second card.

Heat issues

Heat issues can be a significant factor in designing a system with 3 or more GPUs. The GPU performance is sensitive to thermal and power demands. The GPU will slow down itself on purpose when it is overheated. For 3 or more GPUs that stack close together, there may be no gap between GPU cards.

Image for post
Image for post
Source

GPU with blower cooler is recommended for this configuration to push hot air sideway and outside the chamber. But a blower GPU is usually noisier and hotter. An open-Air GPU is quieter and less hot but it will not be efficient if the cards are closely packed.

OS

For DL, Linux is usually the first platform for the software release. For a desktop machine, I will use Ubuntu. The installation process takes about 15 minutes, but be prepared for hours of troubleshooting.

Running Experiments with Multiple GPUs

People add identical cards to speed up the training of an experiment. Depends on the DL platform, this can be done with no or little coding. But, for me, I also want a system that can run multiple experiments concurrently: one for long-run training while another card for shorter experimentations. Some people also use multiple cards for hyperparameter searching. Each experiment will run on a separate card with different hyperparameters.

Buy v.s. Custom build

If you build your own machine, you need to make sure your motherboard is designed for the specific speed that you need for each PCIe slot. Also, make sure your power supply has enough power. You can design a “cooler” system literally. I give you information from the DL perspective. It will be better off to carry these criteria to people specialized in configuring a custom-built machine. In specific, first, you should decide your choice of GPUs. Then, you can consult them for the whole computer configurations building around the GPUs with some of the memory and storage recommendations discussed here. Make sure components are compatible will take time. PC Part Picker will be a good start. If you need a highly optimized configuration, consulting them is likely a better choice than the DL people. I find they give better and more practical suggestions.

There is a compelling reason for doing a custom build for multiple GPUs. When I configure a dual GPU system, some vendors push the next level of computers that have little return for me. And I don’t need MS OS installed. The saving adds up and someone contributes it to 20% to 40% saving. I think it is possible if you go with dual GPUs or more. However, because of the time constraint, I assign a lower preference for this option.

My Choice

To achieve my personal workload need, I forsee a dual 2080TI system. The PC vendor leaves me no choice but to select a high-end AMD CPU with a 1,500W power supply. This configuration is about $2.2K plus $2.2K for the dual GPUs. This will be my reference configuration. Now, I have a $4.4K design but I want to see how far I can achieve the same design with less money.

After struggling for a couple of weeks, an opportunity opens up which I could acquire one Titan RTX free. By a lot of lucks, I get my original design getting closer to the sub $2,000 range now. But I was also present with another possible opportunity for two 2080TI cards. The dual GPUs can speed up the training that the Titan RTX cannot match. But the Titan RTX card has a strong appealing point to me. Some of the DL models that I work on, like BERT, need more GPU memory. (I will also discuss options later without the free GPUs.)

My two final possible configurations are dual 2080TI system and a Titan RTX. Now, I have to decide their relative merits on my personal projects. I sleep with them overnight and make my decision. That is a close call. Finally, I pick the Titan RTX. To run multiple experiments, I add a 2070 GPU card with 8 GB memory. Ironically, this combination allows me to stay with a mid-level gaming computer. So actually, it costs less out of my pocket in this scenario.

I pay $100 more to upgrade the CPU to i7 9700. I check it has about 30-40% performance benchmark improvement. The next upgrade costs $350 with less than 10% improvement and I care less. The actual CPU performance depends on many factors including thermal throttling. For computers that have overheating issues, the faster CPU clock speed will not materialize.

I do not upgrade the memory speed (it costs about $100). I can upgrade to 2070 super with no additional charge but I prefer a second card to have a lower power consumption to mitigate some potential issues.

And this is the final configuration.

Image for post
Image for post

Just as a reference, this is the information on the PICe lanes and the chipset.

Image for post
Image for post
Source
Image for post
Image for post

Nevertheless, even the selected computer has two PCIe x16 slot, the motherboard supports x8 speed only.

Including the cost of the RTX 2070 GPU but not the Titan RTX, the system costs $1,781. I get a huge break that is surreal. But the lesson learned is thinking outside of the box. There are always special offers around. Check the secondary market. Some of your network friends may want to get rid of their older GPUs. They may have a spare one on the table and want to pick your brain. By keep looking and pounding, many possibilities may come up. But again, there is no free lunch. These efforts take time and some professionals may value their time more instead.

What are the risks?

There are educated risks that I take in the process. So it is time to check whether I make any major mistakes. My first concern is overheating. So I stress test both GPUs at the same time. RTX 2070 stabilizes around 66°C which I feel pretty comfortable. Titan RTX draws more power and stabilizes around 86°C. It is still below the slowdown temp (91°C) but I will pay more attention to the Titan GPU in the future. For more experimentation, I take out the load from the RTX 2070 and the temperature is around 81°C for the Titan.

Image for post
Image for post

Can I put enough load onto the GPUs without hitting other bottlenecks first? I ran two programs on the GPUs concurrently. The model ran on Titan RTX is more complex to drive up its GPU-utilization. As shown below, the stress test reaches 96% utilization in Titan RTX and 88% in RTX 2070. I know I can still add more load to the GPUs easily. So the preliminary result shows the system can be GPU-bounded.

Image for post
Image for post

Do I have CPU bottleneck? Again, CPU utilization is just around 25%. (Divide the values below by 8 since I have an 8 core CPU.) So my selected CPU does have room for more pre-processing.

Image for post
Image for post

2 months later & lessons learned

2 months later, I use the credits from this purchase to buy a 4TB external drives with an additional out of the pocket cost of $50. In many DL training datasets, besides space needed for the data, it takes up a lot of space to prepare the data and to save the data format that best fits the code. For example, BERT’s implementation takes 600GB space with a big chunk that goes to save the data in TFRecord format for faster processing. I can clean up the space for the data part I need only. But this demonstrates how much space it may take for a single problem. If the price of SSD is cheaper, I will definitely suggest you in getting a larger drive. I also try to upgrade the memory to 64GB but encounters memory compatibility issues. I verify this before but due to unexpected vendor design issues, my first memory upgrade attempt failed. As always, looking for perfection may not be possible in the engineering field. Just make it less costly. This un-expected issue will likely cause me an additional $100 on top of the new memory. But it is not a major issue for me even I do not have the upgrade.

A $1700 range machine

I hope that the information here can narrow it down to fewer choices for you. But to be fair, let’s work out the math for a system around $1,700. As of 2019, you can get a computer for about $1,700 with an Nvidia 2070 super GPU with 8GB GPU memory and 32GB host RAM. That will include a 9th gen i7 processor, 1TB M.2 PCIe NVMe SSD, and 1TB SATA drive. You can add $150 or $600 to upgrade to a 2080 super or 2080TI respectively. It will increase the performance by 50% to 100% respectively. These GPUs will do a lot of great things for you. If more GPU power is needed, add more GPUs. But the cost will grow non-linear as you need to get a more powerful computer. If more GPU memory is needed because your training is out-of-memory, you have to move up to say Titan RTX.

However, computer components are likely to drop in price over time. As the price drops, you can stay with the same target budget but for the better GPU. The extra capacity will not be wasted. For example, one and half months after my purchase, you can upgrade the system to 2TB SSD drive with Nvidia 2080 super GPU for the same price at $1700.

Credits & References

GPU performance & cost comparison: Which GPU(s) to Get for Deep Learning: My Experience and Advice for Using GPUs in Deep Learning

The $1700 great Deep Learning box: Assembly, setup and benchmarks

Scalability benchmark: TensorFlow Performance with 1–4 GPUs — RTX Titan, 2080Ti, 2080, 2070, GTX 1660Ti, 1070, 1080Ti, and Titan V

Picking PC components: PCPartPicker

Mixed Precision Training

Mixed Precision Training of Deep Neural Networks

Training with Mixed Precision

Video Series: Mixed-Precision Training Techniques Using Tensor Cores for Deep Learning

NVIDIA Apex: Tools for Easy Mixed-Precision Training in PyTorch

Cloud TPU System Architecture

Google BERT implementation

Written by

Deep Learning

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store