Tensorflow and PyTorch

TensorFlow is an Open Source Machine Learning Framework.

Security issue of TensorBoard on multi-user systems

It is not recommended for security reasons to run TensorBoard on a multi-user system. ThensorBoard does not come with any means of access control and anyone with access to the multi-user system can attach to your TensorBoard port and act as you! (It might only need some effort to find the port if you do not use the default port.) There is nothing NHR@FAU can do to mitigate these security issues. Even the hint --host localhost in https://github.com/tensorflow/tensorboard/issues/260#issuecomment-471737166 does not help on a multi-user system. The suggestion from https://github.com/tensorflow/tensorboard/issues/267#issuecomment-671820015 does not help either on a multi-user system.

We patched the preinstalled TensorBoard version on Alex according to https://github.com/tensorflow/tensorboard/pull/5570 using a hash will be enforced.

However, we recommend using TensorBoard on your local machine with the HPC-filesystem mounted (e.g. sshfs).

Availability / Target HPC systems

TensorFlow and PyTorch currently are not installed on any of RRZE’s HPC systems as new versions are very frequently released and all groups have their own special needs.

The following HPC systems are best suited:

  • TinyGPU or Alex
  • Woody for CPU-only runs

Notes

Different routes can be taken to get your private installation of TensorFlow or PyTorch. Don’t waste valuable storage in $HOME and use $WORK instead for storing your installation.

#reminder make sure your dependancies are loaded and you are running the installation in an interactive job
module avail python
module load python/XY
module load cuda
module load cudnn

Official Docker images are regularly published on https://hub.docker.com/r/tensorflow/tensorflow and https://hub.docker.com/r/pytorch/pytorch/. These images can be used with Singularity on our HPC systems. Run the following steps on the woody frontend to pull your image:

cd $WORK
export SINGULARITY_CACHEDIR=$(mktemp -d)
singularity pull tensorflow-2.1.0-gpu-py3.sif docker://tensorflow/tensorflow:2.1.0-gpu-py3
rm -rf $SINGULARITY_CACHEDIR

Within your job script, you use the container as follows. /home/* and /apps/ are automatically bind-mounted into the container. On TinyGPU, GPU device libraries are also automatically bind-mounted into the container.

./tensorflow-2.1.0-gpu-py3.sif  ./script.py

Nvidia maintains its own Docker images for TensorFlow on the NVIDIA GPU Cloud (NGC) which are updated once per month: https://ngc.nvidia.com/catalog/containers/nvidia:tensorflow. These images can also be used with Singularity on our TinyGPU cluster. Run the following steps on the woody frontend to pull your image:

cd $WORK
export SINGULARITY_CACHEDIR=$(mktemp -d)
singularity pull tensorflow-ngc-20.03-tf2-py3.sif docker://nvcr.io/nvidia/tensorflow:20.03-tf2-py3
rm -rf $SINGULARITY_CACHEDIR

Within your job script, you use the container as follows. /home/* and /apps/ are automatically bind-mounted into the container. On TinyGPU, GPU device libraries are also automatically bind-mounted into the container.

./tensorflow-ngc-20.03-tf2-py3.sif  script.py

When manually installing TensorFlow or PyTorch (into a Python VirtualEnv) using pip, remember to load a python module first! The system python will not be sufficient.

A simple pip install tensorflow will not work! You need to install cudatoolkit cudnn first to get GPU support.

PyTorch provides some help for the pip install command see https://pytorch.org/get-started/locally/. Select stable linux pip python cuda-$version, where $version is the CUDA module version you previously loaded from modules.

Anaconda also comes with TensorFlow packages in conda-forge. Either load one of the python modules and install the additional packages into one of your directories or start with your private (mini)conda installation from scratch! The system python will not be sufficient.

To convince the login node (which does not have a GPU) to install the GPU version of TensorFlow using conda, temporarily setting the environment variable CONDA_OVERRIDE_CUDA may be necessary, cf. https://conda-forge.org/blog/posts/2021-11-03-tensorflow-gpu/

PyTorch provides some help for the conda install command see https://pytorch.org/get-started/locally/. Select stable linux pip python cuda-$version, where $version is the CUDA module version you previously loaded from modules.

To check that your TensorFlow is functional and detects the hardware, you can use the following simple python sequence:

python -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"

To check that your PyTorch is functional and detects the hardware, you can use the following simple lines on your bash:

python -c 'import torch; print(torch.rand(2,3).cuda())'

Further information

Mentors

  • please volunteer!
  • Prof. Harald Köstler (NHR/LSS)