Tensorflow and PyTorch
TensorFlow is an Open Source Machine Learning Framework.
Security issue of TensorBoard on multi-user systems
It is not recommended for security reasons to run TensorBoard on a multi-user system. ThensorBoard does not come with any means of access control and anyone with access to the multi-user system can attach to your TensorBoard port and act as you! (It might only need some effort to find the port if you do not use the default port.) There is nothing NHR@FAU can do to mitigate these security issues. Even the hint --host localhost
in https://github.com/tensorflow/tensorboard/issues/260#issuecomment-471737166 does not help on a multi-user system. The suggestion from https://github.com/tensorflow/tensorboard/issues/267#issuecomment-671820015 does not help either on a multi-user system.
We patched the preinstalled TensorBoard version on Alex according to https://github.com/tensorflow/tensorboard/pull/5570 using a hash will be enforced.
However, we recommend using TensorBoard on your local machine with the HPC-filesystem mounted (e.g. sshfs).
Availability / Target HPC systems
TensorFlow and PyTorch currently are not installed on any of RRZE’s HPC systems as new versions are very frequently released and all groups have their own special needs.
The following HPC systems are best suited:
- TinyGPU or Alex
- Woody for CPU-only runs
Notes
Different routes can be taken to get your private installation of TensorFlow or PyTorch. Don’t waste valuable storage in $HOME
and use $WORK
instead for storing your installation.
#reminder make sure your dependancies are loaded and you are running the installation in an interactive job module avail python module load python/XY module load cuda module load cudnn
Using pre-built Docker images from DockerHub
Official Docker images are regularly published on https://hub.docker.com/r/tensorflow/tensorflow and https://hub.docker.com/r/pytorch/pytorch/. These images can be used with Singularity on our HPC systems. Run the following steps on the woody frontend to pull your image:
cd $WORK export SINGULARITY_CACHEDIR=$(mktemp -d) singularity pull tensorflow-2.1.0-gpu-py3.sif docker://tensorflow/tensorflow:2.1.0-gpu-py3 rm -rf $SINGULARITY_CACHEDIR
Within your job script, you use the container as follows. /home/*
and /apps/
are automatically bind-mounted into the container. On TinyGPU, GPU device libraries are also automatically bind-mounted into the container.
./tensorflow-2.1.0-gpu-py3.sif ./script.py
Using pre-built Docker images from Nvidia
cd $WORK export SINGULARITY_CACHEDIR=$(mktemp -d) singularity pull tensorflow-ngc-20.03-tf2-py3.sif docker://nvcr.io/nvidia/tensorflow:20.03-tf2-py3 rm -rf $SINGULARITY_CACHEDIR
Within your job script, you use the container as follows. /home/*
and /apps/
are automatically bind-mounted into the container. On TinyGPU, GPU device libraries are also automatically bind-mounted into the container.
./tensorflow-ngc-20.03-tf2-py3.sif script.py
pip / virtual env
When manually installing TensorFlow or PyTorch (into a Python VirtualEnv) using pip
, remember to load a python module first! The system python will not be sufficient.
A simple pip install tensorflow
will not work! You need to install cudatoolkit cudnn
first to get GPU support.
PyTorch provides some help for the pip install
command see https://pytorch.org/get-started/locally/. Select stable linux pip python cuda-$version
, where $version
is the CUDA module version you previously loaded from modules.
conda
Anaconda also comes with TensorFlow packages in conda-forge. Either load one of the python modules and install the additional packages into one of your directories or start with your private (mini)conda installation from scratch! The system python will not be sufficient.
To convince the login node (which does not have a GPU) to install the GPU version of TensorFlow using conda, temporarily setting the environment variable CONDA_OVERRIDE_CUDA
may be necessary, cf. https://conda-forge.org/blog/posts/2021-11-03-tensorflow-gpu/
PyTorch provides some help for the conda install
command see https://pytorch.org/get-started/locally/. Select stable linux pip python cuda-$version
, where $version
is the CUDA module version you previously loaded from modules.
To check that your TensorFlow is functional and detects the hardware, you can use the following simple python sequence:
python -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"
To check that your PyTorch is functional and detects the hardware, you can use the following simple lines on your bash:
python -c 'import torch; print(torch.rand(2,3).cuda())'
Further information
- https://www.tensorflow.org/
- https://github.com/tensorflow/tensorflow
- https://ngc.nvidia.com/catalog/containers/nvidia:tensorflow
Mentors
- please volunteer!
- Prof. Harald Köstler (NHR/LSS)