Tensorflow and PyTorch

TensorFlow is an Open Source Machine Learning Framework.

Availability / Target HPC systems

TensorFlow and PyTorch currently are not installed on any of RRZE’s HPC systems as new versions are very frequently released and all groups have their own special needs.

The following HPC systems are best suited:

  • TinyGPU, Alex, or GPU nodes in Emmy
  • Woody smaller but many for CPU-only runs

Notes

Different routes can be taken to get your private installation of TensorFlow or PyTorch. Don’t waste valuable storage in $HOME and use $WORK instead for storing your installation.

The pre-built Docker images might not work on the GTX980 nodes in TinyGPU as their host CPU is too old to support the required AVX instruction set.

Official Docker images are regularly published on https://hub.docker.com/r/tensorflow/tensorflow and https://hub.docker.com/r/pytorch/pytorch/. These images can be used with Singularity on our HPC systems. Run the following steps on the woody frontend to pull your image:

cd $WORK
export SINGULARITY_CACHEDIR=$(mktemp -d)
singularity pull tensorflow-2.1.0-gpu-py3.sif docker://tensorflow/tensorflow:2.1.0-gpu-py3
rm -rf $SINGULARITY_CACHEDIR

Within your job script, you use the container as follows. /home/* and /apps/ are automatically bind-mounted into the container. On TinyGPU (but currently not on Emmy), GPU device libraries are also automatically bind-mounted into the container.

./tensorflow-2.1.0-gpu-py3.sif  ./script.py

On the GPU nodes of Emmy, you have to use singularity run --nv tensorflow-2.1.0-gpu-py3.sif  ./script.py.

Nvidia maintains its own Docker images for TensorFlow on the NVIDIA GPU Cloud (NGC) which are updated once per month: https://ngc.nvidia.com/catalog/containers/nvidia:tensorflow. These images can also be used with Singularity on our TinyGPU cluster. Run the following steps on the woody frontend to pull your image:
cd $WORK
export SINGULARITY_CACHEDIR=$(mktemp -d)
singularity pull tensorflow-ngc-20.03-tf2-py3.sif docker://nvcr.io/nvidia/tensorflow:20.03-tf2-py3
rm -rf $SINGULARITY_CACHEDIR

Within your job script, you use the container as follows. /home/* and /apps/ are automatically bind-mounted into the container. On TinyGPU (but currently not on Emmy), GPU device libraries are also automatically bind-mounted into the container.

./tensorflow-ngc-20.03-tf2-py3.sif  script.py

On the GPU nodes of Emmy, you have to use singularity run --nv tensorflow-ngc-20.03-tf2-py3.sif  ./script.py.

When manually installing TensorFlow or PyTorch (into a Python VirtualEnv) using pip, remember to load a python module first! The system python will not be sufficient.

A simple pip install tensorflow will not work! You need to install cudatoolkit cudnn first to get GPU support.

PyTorch provides some help for the pip install command see https://pytorch.org/get-started/locally/. Select stable linux pip python cuda-$version, where $version is the CUDA module version you previously loaded from modules.

Anaconda also comes with TensorFlow packages in conda-forge. Either load one of the python modules and install the additional packages into one of your directories or start with your private (mini)conda installation from scratch! The system python will not be sufficient.

PyTorch provides some help for the conda install command see https://download.pytorch.org/whl/torch_stable.html. Select stable linux pip python cuda-$version, where $version is the CUDA module version you previously loaded from modules.

To check that your TensorFlow is functional and detects the hardware, you can use the following simple python sequence:

from tensorflow.python.client import device_lib
print(device_lib.list_local_devices())

To check that your PyTorch is functional and detects the hardware, you can use the following simple lines on your bash:

python -c 'import torch; print(torch.rand(2,3).cuda())'

Further information

Mentors

  • please volunteer!