• Skip navigation
  • Skip to navigation
  • Skip to the bottom
Simulate organization breadcrumb open Simulate organization breadcrumb close
  • FAUTo the central FAU website
  • RRZE
  • NHR-Verein e.V.
  • Gauß-Allianz

Navigation Navigation close
  • News
  • About us
    • People
    • Funding
    • NHR Compute Time Projects
    • Tier3 User Project Reports
    • Success Stories from the Support
    • Annual Report
    • Jobs
    Portal About us
  • Research
    • Research Focus
    • Publications, Posters and Talks
    • Software & Tools
    • HPC Performance Lab
    • Atomic Structure Simulation Lab
    • NHR PerfLab Seminar
    • Projects
    • Awards
    Portal Research
  • Teaching & Training
    • Lectures and Seminars
    • Tutorials & Courses
    • Theses
    • HPC Café
    • Student Cluster Competition
    Portal Teaching & Training
  • Systems & Services
    • Systems, Documentation & Instructions
    • Support & Contact
    • Training Resources
    • Summary of System Utilization
    Portal Systems & Services
  • FAQ

  1. Home
  2. Systems & Services
  3. Systems, Documentation & Instructions
  4. Special applications, and tips & tricks
  5. Tensorflow and PyTorch

Tensorflow and PyTorch

In page navigation: Systems & Services
  • Systems, Documentation & Instructions
    • Getting started with HPC
      • NHR@FAU HPC-Portal Usage
    • Job monitoring with ClusterCockpit
    • NHR application rules – NHR@FAU
    • HPC clusters & systems
      • Dialog server
      • Alex GPGPU cluster (NHR+Tier3)
      • Fritz parallel cluster (NHR+Tier3)
      • Meggie parallel cluster (Tier3)
      • Emmy parallel cluster (Tier3)
      • Woody(-old) throughput cluster (Tier3)
      • Woody throughput cluster (Tier3)
      • TinyFat cluster (Tier3)
      • TinyGPU cluster (Tier3)
      • Test cluster
      • Jupyterhub
    • SSH – Secure Shell access to HPC systems
    • File systems
    • Batch Processing
      • Job script examples – Slurm
      • Advanced topics Slurm
    • Software environment
    • Special applications, and tips & tricks
      • Amber/AmberTools
      • ANSYS CFX
      • ANSYS Fluent
      • ANSYS Mechanical
      • Continuous Integration / Gitlab Cx
        • Continuous Integration / One-way syncing of GitHub to Gitlab repositories
      • CP2K
      • CPMD
      • GROMACS
      • IMD
      • Intel MKL
      • LAMMPS
      • Matlab
      • NAMD
      • OpenFOAM
      • ORCA
      • Python and Jupyter
      • Quantum Espresso
      • R and R Studio
      • Spack package manager
      • STAR-CCM+
      • Tensorflow and PyTorch
      • TURBOMOLE
      • VASP
        • Request access to central VASP installation
      • Working with NVIDIA GPUs
      • WRF
  • Support & Contact
    • HPC Performance Lab
    • Atomic Structure Simulation Lab
  • HPC User Training
  • HPC System Utilization

Tensorflow and PyTorch

TensorFlow is an Open Source Machine Learning Framework.

Security issue of TensorBoard on multi-user systems

It is not recommended for security reasons to run TensorBoard on a multi-user system. ThensorBoard does not come with any means of access control and anyone with access to the multi-user system can attach to your TensorBoard port and act as you! (It might only need some effort to find the port if you do not use the default port.) There is nothing NHR@FAU can do to mitigate these security issues. Even the hint --host localhost in https://github.com/tensorflow/tensorboard/issues/260#issuecomment-471737166 does not help on a multi-user system. The suggestion from https://github.com/tensorflow/tensorboard/issues/267#issuecomment-671820015 does not help either on a multi-user system.

We patched the preinstalled TensorBoard version on Alex according to https://github.com/tensorflow/tensorboard/pull/5570 using a hash will be enforced.

However, we recommend using TensorBoard on your local machine with the HPC-filesystem mounted (e.g. sshfs).

Availability / Target HPC systems

TensorFlow and PyTorch currently are not installed on any of RRZE’s HPC systems as new versions are very frequently released and all groups have their own special needs.

The following HPC systems are best suited:

  • TinyGPU or Alex
  • Woody for CPU-only runs

Notes

Different routes can be taken to get your private installation of TensorFlow or PyTorch. Don’t waste valuable storage in $HOME and use $WORK instead for storing your installation.

#reminder make sure your dependancies are loaded and you are running the installation in an interactive job
module avail python
module load python/XY
module load cuda
module load cudnn

Using pre-built Docker images from DockerHub

Official Docker images are regularly published on https://hub.docker.com/r/tensorflow/tensorflow and https://hub.docker.com/r/pytorch/pytorch/. These images can be used with Singularity on our HPC systems. Run the following steps on the woody frontend to pull your image:

cd $WORK
export SINGULARITY_CACHEDIR=$(mktemp -d)
singularity pull tensorflow-2.1.0-gpu-py3.sif docker://tensorflow/tensorflow:2.1.0-gpu-py3
rm -rf $SINGULARITY_CACHEDIR

Within your job script, you use the container as follows. /home/* and /apps/ are automatically bind-mounted into the container. On TinyGPU, GPU device libraries are also automatically bind-mounted into the container.

./tensorflow-2.1.0-gpu-py3.sif  ./script.py

Using pre-built Docker images from Nvidia

Nvidia maintains its own Docker images for TensorFlow on the NVIDIA GPU Cloud (NGC) which are updated once per month: https://ngc.nvidia.com/catalog/containers/nvidia:tensorflow. These images can also be used with Singularity on our TinyGPU cluster. Run the following steps on the woody frontend to pull your image:
cd $WORK
export SINGULARITY_CACHEDIR=$(mktemp -d)
singularity pull tensorflow-ngc-20.03-tf2-py3.sif docker://nvcr.io/nvidia/tensorflow:20.03-tf2-py3
rm -rf $SINGULARITY_CACHEDIR

Within your job script, you use the container as follows. /home/* and /apps/ are automatically bind-mounted into the container. On TinyGPU, GPU device libraries are also automatically bind-mounted into the container.

./tensorflow-ngc-20.03-tf2-py3.sif  script.py

pip / virtual env

When manually installing TensorFlow or PyTorch (into a Python VirtualEnv) using pip, remember to load a python module first! The system python will not be sufficient.

A simple pip install tensorflow will not work! You need to install cudatoolkit cudnn first to get GPU support.

PyTorch provides some help for the pip install command see https://pytorch.org/get-started/locally/. Select stable linux pip python cuda-$version, where $version is the CUDA module version you previously loaded from modules.

conda

Anaconda also comes with TensorFlow packages in conda-forge. Either load one of the python modules and install the additional packages into one of your directories or start with your private (mini)conda installation from scratch! The system python will not be sufficient.

PyTorch provides some help for the conda install command see https://download.pytorch.org/whl/torch_stable.html. Select stable linux pip python cuda-$version, where $version is the CUDA module version you previously loaded from modules.

To check that your TensorFlow is functional and detects the hardware, you can use the following simple python sequence:

from tensorflow.python.client import device_lib
print(device_lib.list_local_devices())

To check that your PyTorch is functional and detects the hardware, you can use the following simple lines on your bash:

python -c 'import torch; print(torch.rand(2,3).cuda())'

Further information

  • https://www.tensorflow.org/
  • https://github.com/tensorflow/tensorflow
  • https://ngc.nvidia.com/catalog/containers/nvidia:tensorflow

Mentors

  • please volunteer!
  • Prof. Harald Köstler (NHR/LSS)
Erlangen National High Performance Computing Center (NHR@FAU)
Martensstraße 1
91058 Erlangen
Germany
  • Imprint
  • Privacy
  • Accessibility
  • How to find us
Up