Deep Learning Software Setup with GPU

Image for post
Image for post

Open-source platform rules DL. However, setting up open-source software is a “Plug and Pray” experience. The lack of information hurts. But usually, we are overwhelmed with promotional material or information that tries to cover all minor possibilities. This article provides you a jump-start on software setup that covers Ubuntu 18.04 installation, Nvidia drivers, CUDA, cuDNN, Nvidia GPU-Accelerated Containers (NGC), TensorFlow, PyTorch, and Dockers, etc… This will be 80% of what you may need.

However, if you need to purchase a deep learning machine first, here is an article on it.

In this article, I try to provide simple instructions. But the challenge is every computer has different hardware and software configurations and people want things differently. There are no universal instructions. Google your problems extensively when needed. It is un-avoidable.

Ubuntu installation

  1. Let’s get an overview of the installation process here first (about 10 minutes).
  2. Download Ubuntu (note: this instruction uses the Desktop version). Select the LTS (Long Term Support) version. As of Nov. 2019, this will be Ubuntu 18.04.3 LTS.
  3. Follow the instruction here to create a USB bootable for Ubuntu.

During the installation, it is likely that the original OS and the data may be destroyed. Even setting changes may be reversible, the original OS can be non-bootable without reinstallation. Therefore,

Always create a recovery drive and an image backup for your original computer.

Get a 32GB USB drive for the recovery disc and an 8GB USB drive for the Ubuntu bootable. The image backup can store in an external drive.

So let’s start installing Ubuntu. Plug-in your prepared USB bootable drive. Reboot the machine and hit F-12 to the boot mode (this key will depend on your PC vendor). Then, select the USB drive to boot the system.

Image for post
Image for post

If you see the screen below, your prayer is answered.

Image for post
Image for post

But if you have only Nvidia GPU in your computer (without any other graphics processors like Intel), your display will be scrambled because the display driver in the Ubuntu distribution does not work with new Nvidia GPUs. (Nvidia driver is now included in Ubuntu 19.10 but 19.10 is not an LTS version and not enabled by default.)

Reboot the system and select the USB bootable again. But right immediately, press “e”. That leads you to a screen in which you can temporarily add a kernel boot parameter.

Image for post
Image for post

Add the parameter, nomodeset, to disable the loading of the bad display driver. Hit F10 to continue the booting process. The computer will at least boot to a low-resolution GUI and we can fix the driver after the installation. But for the demonstration purpose, I will use the regular good screen shoot here.

Next, you can following the screens in installing Ubuntu. In one of the screens, I also select “install third-party software …”

Image for post
Image for post

I don’t need dual-boot with MS Windows, so I select “Erase disk”. Warning: this will erase everything in the hard drive.

Image for post
Image for post

Pick the choices you like to complete the installation. “Something else” allows more advanced disk partition. You can Google the information easily (like here). Personally, the additional flexibility is not very important to me and I can mitigate it with other methods.

Since Nvidia drivers are not installed yet, remember to add nomodeset again whenever you reboot your system.

Issues

Your installation experience and issues may vary depending on your hardware and configuration. I am going to tell you a couple of hurdles that I face. But be aware that they may not happen to you or you may have other issues. In the installation process, the installer cannot see my SSD drive.

Image for post
Image for post

I change my SATA configuration in BIOS from RAID ON to AHCI. WARNING: This setting makes your Windows non-bootable. For dual boot configuration, please Google “AHCI for SATA in BIOS without Reinstalling Windows” and perform the reconfiguration task first. However, it is not important for me because I decide to wipe out Windows anyway.

But in the process, I run into Secure Boot issues. The computer is not bootable with the error.

Failed to open \EFI\BOOT\mmx64.efi - Not Found
Failed to load image \EFI\BOOT\mmx64.efi: Not Found
Failed to start MokManager: Not Fond
Something has gone seriously wrong: import_mok_state() failed: Not Found

As I said, you will spend a lot of time Googling to figure out these one-off issues yourself. I eventually turn off Secure Boot in the BIOS. I suggest readers studying this issue more before taking any action if you encounter the Secure Boot problem (thread, MS, Ref 2). There are alternatives with different amount of investigation work. But the Secure Boot issue happens more frequently than people may think.

I turn off the Secure Boot. But I have to stress that this is a personal decision after evaluating the security risk and you may not even have this problem in the first place. This choice is debatable and please makes your own judgment. Here are my changes (in red rectangle) in the BIOS for the Secure Boot and SATA Operation. These are the two extra issues that I encounter.

Image for post
Image for post

After the installation is completed, we can fix the display driver. First, find out the version of the driver that you need. Boot the installed Ubuntu system with the kernel parameter nomodeset. Replace 440 below with the driver version you need.

sudo add-apt-repository ppa:graphics-drivers
sudo apt-get update
sudo apt-get install nvidia-driver-440
sudo reboot

People may purge their old installed Nvidia drivers first. But since I have a new system, I don’t need to.

sudo apt purge nvidia-*

A system restart is needed whenever the driver is changed. For future driver update after Ubuntu is installed, you can also use its GUI interface demonstrated here.

Now, you don’t need to set the nomodeset parameter anymore. Just practice “Plug-and-Pray” again :-). I have installed Ubuntu in 15 minutes but sometimes the whole process may take much longer with unexpected issues.

SSH configuration

It is nice to set up SSH connection with a certificate and disable the password login. Here is the instruction for CentOS using yum. For Ubuntu, use apt-get instead (as shown below).

sudo apt-get install openssh-server
sudo cp /etc/ssh/sshd_config /etc/ssh/sshd_config.factory-defaults

To restart the SSH service in the GPU machine:

sudo systemctl restart ssh

Package installation

Let’s install some general packages that are needed in many package installation first.

sudo apt-get install build essential
sudo apt-get install freeglut3 freeglut3-dev libxi-dev libxmu-dev
sudo apt install curl

For monitoring (like iostat), installs

sudo apt install sysstat

Install the CUDA toolkit

Nvidia CUDA toolkit provides a development environment to create GPU-accelerated applications. Deep Learning (DL) platform uses it to speed up operations and need to be installed for GPU.

Come here and select the options like the one below.

Image for post
Image for post

This will generate the instructions above for you to run. Below is the one in the text form.

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/cuda-ubuntu1804.pinsudo mv cuda-ubuntu1804.pin /etc/apt/preferences.d/cuda-repository-pin-600wget http://developer.download.nvidia.com/compute/cuda/10.2/Prod/local_installers/cuda-repo-ubuntu1804-10-2-local-10.2.89-440.33.01_1.0-1_amd64.debsudo dpkg -i cuda-repo-ubuntu1804-10-2-local-10.2.89-440.33.01_1.0-1_amd64.debsudo apt-key add /var/cuda-repo-10-2-local-10.2.89-440.33.01/7fa2af80.pubsudo apt-get updatesudo apt-get -y install cuda

But if you have dependency issues in the last command in installing CUDA, you can use aptitude instead. It handles the dependency issue better here.

sudo apt-get install aptitude
sudo aptitude install cuda

In our example, CUDA 10.2 will be installed and you can set the following into your environment.

export PATH=$PATH:/usr/local/cuda/bin
export CUDADIR=/usr/local/cuda
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/lib64

Nvidia cuDNN

cuDNN is an Nvidia GPU-accelerated library for DL. Download from here for the three packages needed for Ubuntu. Then install them with the corresponding name of the downloaded packages.

sudo dpkg -i libcudnn7_7.6.5.32-1+cuda10.2_amd64.deb
sudo dpkg -i libcudnn7-dev_7.6.5.32-1+cuda10.2_amd64.deb
sudo dpkg -i libcudnn7-doc_7.6.5.32-1+cuda10.2_amd64.deb

Your new environment settings should be set to:

export PATH=$PATH:/usr/local/cuda/bin
export CUDADIR=/usr/local/cuda
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64

Follow the instructions below to test the installations. If you can build and pass the MNIST application below, your system is ready for CUDA and cuDNN.

Image for post
Image for post
Source

NGC (Nvidia GPU-Accelerated Containers)

Next, we set up the environment for NGC (Nvidia GPU-Accelerated Containers). This is the Docker container where you can run the goodies from Nvidia including the Nvidia AMP (Automatic Mixed Precision).

Volta or Turning architect (20x0 GPUs, RTX Titan, V100, …) is required for NGC.

Skip the NGC related installation sections here if you don’t have the supported GPUs.

Docker installation

First, we need to install Docker. Here is the instruction and we will install it with apt-get.

To test it, download a “hello-word” Docker image and runs a container.

docker run hello-world

But you may run into the root access problem.

docker run hello-worlddocker: Got permission denied while trying to connect to the Docker daemon socket at unix:///var/run/docker.sock: Post http://%2Fvar%2Frun%2Fdocker.sock/v1.38/containers/create: dial unix /var/run/docker.sock: connect: permission denied.
See 'docker run --help'.

To fix that

sudo usermod -aG docker $USER 
sudo chown root:docker /var/run/docker.sock

Then logout and login again.

Docker instruction (optional)

Since we may run a Docker image quite frequently, let’s have a quick summary of their commands.

“docker run hello-word” load the hello-world image, run it and then exit.

docker run hello-world

Even the program exits, the container resources, including log files, stay so we can examine it later.

List the Docker containers including running and exited.

docker ps -a
Image for post
Image for post

To reattach and rerun a stopped container named myst_kilby.

docker start --attach myst_kilby # Rerun a hello-world container

The container below downloads the Ubuntu image and starts a login shell as root in the container. We can use the flag “-it” to open an interactive terminal to interact with it.

docker run -it ubuntu bash

To stop a container that is still running:

docker stop fervent_greider  # fervent_greider is the container name

Start an Nginx Web server:

docker run -v ~/my_html:/usr/share/nginx/html:ro -p 8080:80 -d nginx

“-v” maps a local host drive to a directory in the container. For example, in the example above, we can store the web page in ~/my_html and the webserver container will view it as /usr/share/nginx/html. “-d” run the container as a daemon. This command allows us to access the server with port 8080 which maps to port 80 in the container.

To remove the container named fervent_greider and clean up the resources.

docker rm fervent_greider

We can use “- -rm” to instruct the docker to remove the container after it exits.

docker run --rm hello-world

To remove all exited containers and their resources.

docker system prune -a
Image for post
Image for post

NVIDIA Docker

The Nvidia Docker Container Toolkit builds and runs NGC. It is an extension to the Docker. Below is the instruction originated from here.

distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
sudo systemctl restart docker

Again, we can test the installation by:

#### Test nvidia-smi with the latest official CUDA image
$ docker run --gpus all nvidia/cuda:9.0-base nvidia-smi
# Start a GPU enabled container on two GPUs
$ docker run --gpus 2 nvidia/cuda:9.0-base nvidia-smi
# Starting a GPU enabled container on specific GPUs
$ docker run --gpus '"device=1,2"' nvidia/cuda:9.0-base nvidia-smi
$ docker run --gpus '"device=UUID-ABCDEF,1"' nvidia/cuda:9.0-base nvidia-smi
# Specifying a capability (graphics, compute, ...) for my container
# Note this is rarely if ever used this way
$ docker run --gpus all,capabilities=utility nvidia/cuda:9.0-base nvidia-smi

The “gpus” flag indicates which GPUs to use. Here is the output demonstrating with the 2 GPUs I have.

Image for post
Image for post

NGC for TensorFlow

Finally, we are going to pull the NGC Docker image. Here, we can find the NGC for TensorFlow version number in the link and the pull command to pull the Docker image. Run this command in a terminal.

Image for post
Image for post

Replace tf1 with tf2 above for TensorFlow 2.x version. We can test the installation with the command below. But replace tensforflow:19.12-tf2-py3 with your docker image.

docker run --gpus all -it --rm nvcr.io/nvidia/tensorflow:19.12-tf2-py3

Here is the output:

Image for post
Image for post

To run a TensorFlow application, you can

# cd to the directory containing myapp.pydocker run --gpus all -it --rm -v $PWD:/workspace nvcr.io/nvidia/tensorflow:19.12-tf2-py3 python myapp.py

Or simply type the command below for an interactive shell.

docker run --gpus all -it --rm -v $PWD:/workspace nvcr.io/nvidia/tensorflow:19.12-tf2-py3

Then, type any commands you want, include python app.py.

Use “all” to

--gpus all

make all GPUs available to the application. However, for some platforms like TensorFlow, without a couple of lines of extra coding, it only picks one of the GPU only. To use a specific GPU, uses

--gpus ‘“device=1”’

with the specific ID for the GPU.

NGC for PyTorch

There is NGC docker image for PyTorch also.

Image for post
Image for post

Again, we can download and test the image with:

docker pull nvcr.io/nvidia/pytorch:19.10-py3docker run --gpus all -it --rm nvcr.io/nvidia/pytorch:19.10-py3

Install Python & pip

While we can run DL applications with Docker, there are other ways in setting up your own environment. Let’s prepare our system first.

sudo apt update
sudo apt install python3-dev python3-pip python3-testresources
sudo pip3 install -U virtualenv

Here has the information on the Python version you needed. We will use Python 3.7 for now. (Personally, I don’t have much problem even using a later Python version.) But let’s keep the change minimum and stay with the suggested one.

sudo add-apt-repository ppa:deadsnakes/python-3.7
sudo apt-get update
sudo apt-get install python3.7

You can check the installation by:

python3.7 --version

Note that python3 remains to be 3.6.

python3 --version
Python 3.6.8

We will keep it this way. Change it to 3.7 will break the gnome-terminal in Ubuntu 18.04. Even there are simple workarounds, the possible side effect of changing python3 can be problematic. And we should not interfere with the system version of pip also. A general suggestion is not to interfere with the system python3 and pip. Create a local environment instead. To invoke 3.7, we will use python3.7. (We can create a symbolic link in /usr/bin (say pythontf). If you want your script to be forward compatible, use pythontf instead.)

TensorFlow using Virtual environment

Next, we will create and activate a virtual environment. The first instruction below creates a local environment in ./venv. This creates a separate environment with its own package stored under “venv”.

virtualenv --system-site-packages -p python3.7 ./venv
source ./venv/bin/activate

Inside the virtual environment, we can upgrade pip without impact the global one.

pip install --upgrade pip

The final step installs the GPU version of TensorFlow.

pip install --upgrade tensorflow-gpu

Let’s test it out.

python -c "import tensorflow as tf;print(tf.reduce_sum(tf.random.normal([1000, 1000])))"

If this works, you achieve the biggest milestone in this article. Congratulations!

To exit the virtual environment:

deactivate

To get back to the virtual environment

source ./venv/bin/activate

And we can run a TensorFlow application again.

If you don’t use the virtual environment, make sure you use and upgrade a local python pip instead of the system pip. Other applications in the computer system may break because of your pip upgrade.

So if you are NOT running under a virtual environment (for example, using source ./venv/bin/activate), don’t use the command below. If you need to use sudo pip install, you are likely on the wrong path. That forces the system to be in root to change the system pip.

pip install --upgrade pip
pip install --upgrade tensorflow-gpu

Instead, we should perform

python3.7 -m pip install --upgrade pip
python3.7 -m pip install --upgrade tensorflow-gpu

This run pip under a local environment with python3.7 and packages will now be installed in:

~/.local/lib/python3.7/site-packages

TensorFlow with Docker

In this section, we will look into running the TensorFlow Docker image instead of NGC. Here, we pull the TensorFlow image. Then we execute an inline program.

docker pull tensorflow/tensorflow                 
docker run -it --rm tensorflow/tensorflow python -c "import tensorflow as tf;print(tf.reduce_sum(tf.random.normal([1000, 1000])))"

Or we can put the program in the test.py and run

docker run -it — rm -v $PWD:/tmp -w /tmp tensorflow/tensorflow python ./test.py

-v maps the host directory to a container directory. -w specifies the working directory as /tmp in the container.

-v $PWD:/tmp -w /tmp

Now, we can save test.py in the current host directory ($PWD). In the container, it becomes /tmp/test.py. Since the working directory is /tmp, we can address test.py as ./test.py in the command line.

To run a Jupyter Notebook:

docker run -it -p 8888:8888 tensorflow/tensorflow:nightly-py3-jupyter

The output will generate the URL address, like the one below, that you can use to access the notebook.

http://127.0.0.1:8888/?token=12ecf21846a0644a30f52425eced6cdc651b825da2c676057

To run a docker image with GPU:

docker run --gpus all -it --rm tensorflow/tensorflow:latest-gpu \
python -c "import tensorflow as tf; print(tf.reduce_sum(tf.random.normal([1000, 1000])))"

The flag “-gpus all” indicates using all GPUs.

PyTorch with Anaconda

Anaconda

The preferred package manager for PyTorch is Anaconda. Go here and find the link address for its installer.

Image for post
Image for post

Copy the link address above to the command below and run the downloaded installer.

curl -O https://repo.anaconda.com/archive/Anaconda3-2019.10-Linux-x86_64.sh
bash Anaconda3-2019.10-Linux-x86_64.sh

Test the installation

conda --version

Logout and login again from the terminal. I prefer the conda’s base environment not to be activated on startup, So, I run

conda config — set auto_activate_base false

PyTorch

Come there to configure the command to install PyTorch with conda.

Image for post
Image for post

This will install the pytorch and other packages onto the current environment (which is base now).

conda install pytorch torchvision cudatoolkit=10.1 -c pytorch

Let’s run a PyTorch application:

Image for post
Image for post

In the second step, store the PyTorch code in test2.py.

conda activate
(base) jhui@r9:~$ vi test2.py
(base) jhui@r9:~$ python test2.py

Here is the code. Run it with python test2.py.

import torch
x = torch.rand(5, 3)
print(x)

As demonstrated below, the python within the conda environment is 3.7.4 while the Ubuntu environment is 2.7.

Image for post
Image for post

conda

Again, let’s summarize some most basic conda commands quickly.

Update conda:

conda update conda

Create a new environment called snowflakes and install the package biopython.

conda create --name snowflakes biopython

Let’s activate snowflakes.

conda activate snowflakes

To deactivate:

conda deactivate

Or switch back to the base:

conda activate

List all the environments. The default will have a “*” next to it.

conda info --envs
Image for post
Image for post

pip

Again we can use the tool to use pip to install the PyTorch instead of conda.

Image for post
Image for post

Nevertheless, similar to TensorFlow, use this command only within a virtual environment.

pip3 install torch torchvision

Otherwise,

python3.7 -m pip install --upgrade torch torchvision

Github (optional)

Now, let’s connect the GPU machine to Github. The instruction here is for those with private repositories. If you only access public repositories, you can skip this section. First, we are going to create an RSA public and private key. In our GPU machine,

cd ~/.ssh
ssh-keygen -t rsa -b 4096 -C "name@mydomain.com"
eval "$(ssh-agent -s)"
ssh-add ~/.ssh/git_id_rsa

Log in to your GitHub account and select settings.

Image for post
Image for post

Select SSH and GPG keys and add a new SSH key.

Image for post
Image for post

Copy the content in the following file and paste it as the new SSH key.

.ssh/git_id_rsa.pub

Back to the GPU machine. And we can test it with.

Image for post
Image for post

To clone a project:

git clone git@github.com:jhui/project.git

Moving Forward

I am very reluctant to write this article. The variants of software and hardware configurations make it extremely hard for universal instructions. Different steps lead a machine to different states. Without tracing what you did, it will be unlikely to solve your problems. Google your error message extensively. If you cannot find any information that gives you hints within an hour, it is likely that your system state or configuration is very different. You may want to restart with a better and known state first. Being said, because of the complexity, I prefer not to address individual troubleshooting for this article. I know your pain but I found it too hard to support it this way.

When you are not familiar with the setup process, keep things simple and not ask for perfection. Refine the process iteratively will get you to the target much faster. If you find a solution that can help people, please leave a note in the response.

Another challenge is to have this information updated. For example, TensorFlow API changes very fast and often not backward compatible. If you find outdated information, please list the old description and state what should be the new one. That will help me a lot to know the changes.

If you know a better way to do things, let me know also. Keep things simple. I want to apply the 80/20 rule: cover the important but not every angle. Be nice to other people’s comments. I kick rude people out. (but just one person this year, 😇)

Credits & References

NGC Container User Guide

Deep Learning

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store