Open-source platform rules DL. However, setting up open-source software is a “Plug and Pray” experience. The lack of information hurts. But usually, we are overwhelmed with promotional material or information that tries to cover all minor possibilities. This article provides you a jump-start on software setup that covers Ubuntu 18.04 installation, Nvidia drivers, CUDA, cuDNN, Nvidia GPU-Accelerated Containers (NGC), TensorFlow, PyTorch, and Dockers, etc… This will be 80% of what you may need.
However, if you need to purchase a deep learning machine first, here is an article on it.
Buy a Deep Learning Computer — David v.s. Goliath
MVP in the tech world is not “Most Valued Player”. Minimum Viable Product (MVP) means testing out hypotheses, finding…
In this article, I try to provide simple instructions. But the challenge is every computer has different hardware and software configurations and people want things differently. There are no universal instructions. Google your problems extensively when needed. It is un-avoidable.
- Let’s get an overview of the installation process here first (about 10 minutes).
- Download Ubuntu (note: this instruction uses the Desktop version). Select the LTS (Long Term Support) version. As of Nov. 2019, this will be Ubuntu 18.04.3 LTS.
- Follow the instruction here to create a USB bootable for Ubuntu.
During the installation, it is likely that the original OS and the data may be destroyed. Even setting changes may be reversible, the original OS can be non-bootable without reinstallation. Therefore,
Get a 32GB USB drive for the recovery disc and an 8GB USB drive for the Ubuntu bootable. The image backup can store in an external drive.
So let’s start installing Ubuntu. Plug-in your prepared USB bootable drive. Reboot the machine and hit F-12 to the boot mode (this key will depend on your PC vendor). Then, select the USB drive to boot the system.
If you see the screen below, your prayer is answered.
But if you have only Nvidia GPU in your computer (without any other graphics processors like Intel), your display will be scrambled because the display driver in the Ubuntu distribution does not work with new Nvidia GPUs. (Nvidia driver is now included in Ubuntu 19.10 but 19.10 is not an LTS version and not enabled by default.)
Reboot the system and select the USB bootable again. But right immediately, press “e”. That leads you to a screen in which you can temporarily add a kernel boot parameter.
Add the parameter, nomodeset, to disable the loading of the bad display driver. Hit F10 to continue the booting process. The computer will at least boot to a low-resolution GUI and we can fix the driver after the installation. But for the demonstration purpose, I will use the regular good screen shoot here.
Next, you can following the screens in installing Ubuntu. In one of the screens, I also select “install third-party software …”
I don’t need dual-boot with MS Windows, so I select “Erase disk”. Warning: this will erase everything in the hard drive.
Pick the choices you like to complete the installation. “Something else” allows more advanced disk partition. You can Google the information easily (like here). Personally, the additional flexibility is not very important to me and I can mitigate it with other methods.
Since Nvidia drivers are not installed yet, remember to add nomodeset again whenever you reboot your system.
Your installation experience and issues may vary depending on your hardware and configuration. I am going to tell you a couple of hurdles that I face. But be aware that they may not happen to you or you may have other issues. In the installation process, the installer cannot see my SSD drive.
I change my SATA configuration in BIOS from RAID ON to AHCI. WARNING: This setting makes your Windows non-bootable. For dual boot configuration, please Google “AHCI for SATA in BIOS without Reinstalling Windows” and perform the reconfiguration task first. However, it is not important for me because I decide to wipe out Windows anyway.
But in the process, I run into Secure Boot issues. The computer is not bootable with the error.
Failed to open \EFI\BOOT\mmx64.efi - Not Found
Failed to load image \EFI\BOOT\mmx64.efi: Not Found
Failed to start MokManager: Not Fond
Something has gone seriously wrong: import_mok_state() failed: Not Found
As I said, you will spend a lot of time Googling to figure out these one-off issues yourself. I eventually turn off Secure Boot in the BIOS. I suggest readers studying this issue more before taking any action if you encounter the Secure Boot problem (thread, MS, Ref 2). There are alternatives with different amount of investigation work. But the Secure Boot issue happens more frequently than people may think.
I turn off the Secure Boot. But I have to stress that this is a personal decision after evaluating the security risk and you may not even have this problem in the first place. This choice is debatable and please makes your own judgment. Here are my changes (in red rectangle) in the BIOS for the Secure Boot and SATA Operation. These are the two extra issues that I encounter.
After the installation is completed, we can fix the display driver. First, find out the version of the driver that you need. Boot the installed Ubuntu system with the kernel parameter nomodeset. Replace 440 below with the driver version you need.
sudo add-apt-repository ppa:graphics-drivers
sudo apt-get update
sudo apt-get install nvidia-driver-440
People may purge their old installed Nvidia drivers first. But since I have a new system, I don’t need to.
sudo apt purge nvidia-*
A system restart is needed whenever the driver is changed. For future driver update after Ubuntu is installed, you can also use its GUI interface demonstrated here.
Now, you don’t need to set the nomodeset parameter anymore. Just practice “Plug-and-Pray” again :-). I have installed Ubuntu in 15 minutes but sometimes the whole process may take much longer with unexpected issues.
It is nice to set up SSH connection with a certificate and disable the password login. Here is the instruction for CentOS using yum. For Ubuntu, use apt-get instead (as shown below).
sudo apt-get install openssh-server
sudo cp /etc/ssh/sshd_config /etc/ssh/sshd_config.factory-defaults
To restart the SSH service in the GPU machine:
sudo systemctl restart ssh
Let’s install some general packages that are needed in many package installation first.
sudo apt-get install build essential
sudo apt-get install freeglut3 freeglut3-dev libxi-dev libxmu-dev
sudo apt install curl
For monitoring (like iostat), installs
sudo apt install sysstat
Install the CUDA toolkit
Nvidia CUDA toolkit provides a development environment to create GPU-accelerated applications. Deep Learning (DL) platform uses it to speed up operations and need to be installed for GPU.
Come here and select the options like the one below.
This will generate the instructions above for you to run. Below is the one in the text form.
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/cuda-ubuntu1804.pinsudo mv cuda-ubuntu1804.pin /etc/apt/preferences.d/cuda-repository-pin-600wget http://developer.download.nvidia.com/compute/cuda/10.2/Prod/local_installers/cuda-repo-ubuntu1804-10-2-local-10.2.89-440.33.01_1.0-1_amd64.debsudo dpkg -i cuda-repo-ubuntu1804-10-2-local-10.2.89-440.33.01_1.0-1_amd64.debsudo apt-key add /var/cuda-repo-10-2-local-10.2.89-440.33.01/7fa2af80.pubsudo apt-get updatesudo apt-get -y install cuda
But if you have dependency issues in the last command in installing CUDA, you can use aptitude instead. It handles the dependency issue better here.
sudo apt-get install aptitude
sudo aptitude install cuda
In our example, CUDA 10.2 will be installed and you can set the following into your environment.
cuDNN is an Nvidia GPU-accelerated library for DL. Download from here for the three packages needed for Ubuntu. Then install them with the corresponding name of the downloaded packages.
sudo dpkg -i libcudnn7_220.127.116.11-1+cuda10.2_amd64.deb
sudo dpkg -i libcudnn7-dev_18.104.22.168-1+cuda10.2_amd64.deb
sudo dpkg -i libcudnn7-doc_22.214.171.124-1+cuda10.2_amd64.deb
Your new environment settings should be set to:
Follow the instructions below to test the installations. If you can build and pass the MNIST application below, your system is ready for CUDA and cuDNN.
NGC (Nvidia GPU-Accelerated Containers)
Next, we set up the environment for NGC (Nvidia GPU-Accelerated Containers). This is the Docker container where you can run the goodies from Nvidia including the Nvidia AMP (Automatic Mixed Precision).
Volta or Turning architect (20x0 GPUs, RTX Titan, V100, …) is required for NGC.
Skip the NGC related installation sections here if you don’t have the supported GPUs.
First, we need to install Docker. Here is the instruction and we will install it with apt-get.
To test it, download a “hello-word” Docker image and runs a container.
docker run hello-world
But you may run into the root access problem.
docker run hello-worlddocker: Got permission denied while trying to connect to the Docker daemon socket at unix:///var/run/docker.sock: Post http://%2Fvar%2Frun%2Fdocker.sock/v1.38/containers/create: dial unix /var/run/docker.sock: connect: permission denied.
See 'docker run --help'.
To fix that
sudo usermod -aG docker $USER
sudo chown root:docker /var/run/docker.sock
Then logout and login again.
Docker instruction (optional)
Since we may run a Docker image quite frequently, let’s have a quick summary of their commands.
“docker run hello-word” load the hello-world image, run it and then exit.
docker run hello-world
Even the program exits, the container resources, including log files, stay so we can examine it later.
List the Docker containers including running and exited.
docker ps -a
To reattach and rerun a stopped container named myst_kilby.
docker start --attach myst_kilby # Rerun a hello-world container
The container below downloads the Ubuntu image and starts a login shell as root in the container. We can use the flag “-it” to open an interactive terminal to interact with it.
docker run -it ubuntu bash
To stop a container that is still running:
docker stop fervent_greider # fervent_greider is the container name
Start an Nginx Web server:
docker run -v ~/my_html:/usr/share/nginx/html:ro -p 8080:80 -d nginx
“-v” maps a local host drive to a directory in the container. For example, in the example above, we can store the web page in ~/my_html and the webserver container will view it as /usr/share/nginx/html. “-d” run the container as a daemon. This command allows us to access the server with port 8080 which maps to port 80 in the container.
To remove the container named fervent_greider and clean up the resources.
docker rm fervent_greider
We can use “- -rm” to instruct the docker to remove the container after it exits.
docker run --rm hello-world
To remove all exited containers and their resources.
docker system prune -a
The Nvidia Docker Container Toolkit builds and runs NGC. It is an extension to the Docker. Below is the instruction originated from here.
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.listsudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
sudo systemctl restart docker
Again, we can test the installation by:
#### Test nvidia-smi with the latest official CUDA image
$ docker run --gpus all nvidia/cuda:9.0-base nvidia-smi# Start a GPU enabled container on two GPUs
$ docker run --gpus 2 nvidia/cuda:9.0-base nvidia-smi# Starting a GPU enabled container on specific GPUs
$ docker run --gpus '"device=1,2"' nvidia/cuda:9.0-base nvidia-smi
$ docker run --gpus '"device=UUID-ABCDEF,1"' nvidia/cuda:9.0-base nvidia-smi# Specifying a capability (graphics, compute, ...) for my container
# Note this is rarely if ever used this way
$ docker run --gpus all,capabilities=utility nvidia/cuda:9.0-base nvidia-smi
The “gpus” flag indicates which GPUs to use. Here is the output demonstrating with the 2 GPUs I have.
NGC for TensorFlow
Finally, we are going to pull the NGC Docker image. Here, we can find the NGC for TensorFlow version number in the link and the pull command to pull the Docker image. Run this command in a terminal.
Replace tf1 with tf2 above for TensorFlow 2.x version. We can test the installation with the command below. But replace tensforflow:19.12-tf2-py3 with your docker image.
docker run --gpus all -it --rm nvcr.io/nvidia/tensorflow:19.12-tf2-py3
Here is the output:
To run a TensorFlow application, you can
# cd to the directory containing myapp.pydocker run --gpus all -it --rm -v $PWD:/workspace nvcr.io/nvidia/tensorflow:19.12-tf2-py3 python myapp.py
Or simply type the command below for an interactive shell.
docker run --gpus all -it --rm -v $PWD:/workspace nvcr.io/nvidia/tensorflow:19.12-tf2-py3
Then, type any commands you want, include python app.py.
Use “all” to
make all GPUs available to the application. However, for some platforms like TensorFlow, without a couple of lines of extra coding, it only picks one of the GPU only. To use a specific GPU, uses
with the specific ID for the GPU.
NGC for PyTorch
There is NGC docker image for PyTorch also.
Again, we can download and test the image with:
docker pull nvcr.io/nvidia/pytorch:19.10-py3docker run --gpus all -it --rm nvcr.io/nvidia/pytorch:19.10-py3
Install Python & pip
While we can run DL applications with Docker, there are other ways in setting up your own environment. Let’s prepare our system first.
sudo apt update
sudo apt install python3-dev python3-pip python3-testresources
sudo pip3 install -U virtualenv
Here has the information on the Python version you needed. We will use Python 3.7 for now. (Personally, I don’t have much problem even using a later Python version.) But let’s keep the change minimum and stay with the suggested one.
sudo add-apt-repository ppa:deadsnakes/python-3.7
sudo apt-get update
sudo apt-get install python3.7
You can check the installation by:
Note that python3 remains to be 3.6.
We will keep it this way. Change it to 3.7 will break the gnome-terminal in Ubuntu 18.04. Even there are simple workarounds, the possible side effect of changing python3 can be problematic. And we should not interfere with the system version of pip also. A general suggestion is not to interfere with the system python3 and pip. Create a local environment instead. To invoke 3.7, we will use python3.7. (We can create a symbolic link in /usr/bin (say pythontf). If you want your script to be forward compatible, use pythontf instead.)
TensorFlow using Virtual environment
Next, we will create and activate a virtual environment. The first instruction below creates a local environment in ./venv. This creates a separate environment with its own package stored under “venv”.
virtualenv --system-site-packages -p python3.7 ./venv
Inside the virtual environment, we can upgrade pip without impact the global one.
pip install --upgrade pip
The final step installs the GPU version of TensorFlow.
pip install --upgrade tensorflow-gpu
Let’s test it out.
python -c "import tensorflow as tf;print(tf.reduce_sum(tf.random.normal([1000, 1000])))"
If this works, you achieve the biggest milestone in this article. Congratulations!
To exit the virtual environment:
To get back to the virtual environment
And we can run a TensorFlow application again.
If you don’t use the virtual environment, make sure you use and upgrade a local python pip instead of the system pip. Other applications in the computer system may break because of your pip upgrade.
So if you are NOT running under a virtual environment (for example, using source ./venv/bin/activate), don’t use the command below. If you need to use sudo pip install, you are likely on the wrong path. That forces the system to be in root to change the system pip.
pip install --upgrade pip
pip install --upgrade tensorflow-gpu
Instead, we should perform
python3.7 -m pip install --upgrade pip
python3.7 -m pip install --upgrade tensorflow-gpu
This run pip under a local environment with python3.7 and packages will now be installed in:
TensorFlow with Docker
In this section, we will look into running the TensorFlow Docker image instead of NGC. Here, we pull the TensorFlow image. Then we execute an inline program.
docker pull tensorflow/tensorflow
docker run -it --rm tensorflow/tensorflow python -c "import tensorflow as tf;print(tf.reduce_sum(tf.random.normal([1000, 1000])))"
Or we can put the program in the test.py and run
docker run -it — rm -v $PWD:/tmp -w /tmp tensorflow/tensorflow python ./test.py
-v maps the host directory to a container directory. -w specifies the working directory as /tmp in the container.
-v $PWD:/tmp -w /tmp
Now, we can save test.py in the current host directory ($PWD). In the container, it becomes /tmp/test.py. Since the working directory is /tmp, we can address test.py as ./test.py in the command line.
To run a Jupyter Notebook:
docker run -it -p 8888:8888 tensorflow/tensorflow:nightly-py3-jupyter
The output will generate the URL address, like the one below, that you can use to access the notebook.
To run a docker image with GPU:
docker run --gpus all -it --rm tensorflow/tensorflow:latest-gpu \
python -c "import tensorflow as tf; print(tf.reduce_sum(tf.random.normal([1000, 1000])))"
The flag “-gpus all” indicates using all GPUs.
PyTorch with Anaconda
The preferred package manager for PyTorch is Anaconda. Go here and find the link address for its installer.
Copy the link address above to the command below and run the downloaded installer.
curl -O https://repo.anaconda.com/archive/Anaconda3-2019.10-Linux-x86_64.sh
Test the installation
Logout and login again from the terminal. I prefer the conda’s base environment not to be activated on startup, So, I run
conda config — set auto_activate_base false
Come there to configure the command to install PyTorch with conda.
This will install the pytorch and other packages onto the current environment (which is base now).
conda install pytorch torchvision cudatoolkit=10.1 -c pytorch
Let’s run a PyTorch application:
In the second step, store the PyTorch code in test2.py.
(base) jhui@r9:~$ vi test2.py
(base) jhui@r9:~$ python test2.py
Here is the code. Run it with python test2.py.
x = torch.rand(5, 3)
As demonstrated below, the python within the conda environment is 3.7.4 while the Ubuntu environment is 2.7.
Again, let’s summarize some most basic conda commands quickly.
conda update conda
Create a new environment called snowflakes and install the package biopython.
conda create --name snowflakes biopython
Let’s activate snowflakes.
conda activate snowflakes
Or switch back to the base:
List all the environments. The default will have a “*” next to it.
conda info --envs
Again we can use the tool to use pip to install the PyTorch instead of conda.
Nevertheless, similar to TensorFlow, use this command only within a virtual environment.
pip3 install torch torchvision
python3.7 -m pip install --upgrade torch torchvision
Now, let’s connect the GPU machine to Github. The instruction here is for those with private repositories. If you only access public repositories, you can skip this section. First, we are going to create an RSA public and private key. In our GPU machine,
ssh-keygen -t rsa -b 4096 -C "firstname.lastname@example.org"
eval "$(ssh-agent -s)"
Log in to your GitHub account and select settings.
Select SSH and GPG keys and add a new SSH key.
Copy the content in the following file and paste it as the new SSH key.
Back to the GPU machine. And we can test it with.
To clone a project:
git clone email@example.com:jhui/project.git
I am very reluctant to write this article. The variants of software and hardware configurations make it extremely hard for universal instructions. Different steps lead a machine to different states. Without tracing what you did, it will be unlikely to solve your problems. Google your error message extensively. If you cannot find any information that gives you hints within an hour, it is likely that your system state or configuration is very different. You may want to restart with a better and known state first. Being said, because of the complexity, I prefer not to address individual troubleshooting for this article. I know your pain but I found it too hard to support it this way.
When you are not familiar with the setup process, keep things simple and not ask for perfection. Refine the process iteratively will get you to the target much faster. If you find a solution that can help people, please leave a note in the response.
Another challenge is to have this information updated. For example, TensorFlow API changes very fast and often not backward compatible. If you find outdated information, please list the old description and state what should be the new one. That will help me a lot to know the changes.
If you know a better way to do things, let me know also. Keep things simple. I want to apply the 80/20 rule: cover the important but not every angle. Be nice to other people’s comments. I kick rude people out. (but just one person this year, 😇)