FAQ

Categories: Usage | Software | Hardware | General information

Usage

Acknowledgement

How to acknowledge resource usage

In general, use the following formulation in publications for acknowledging the resources and the support of NHR@FAU:

  • for the FAU Tier3 resources: “The authors gratefully acknowledge the scientific support and HPC resources provided by the Erlangen National High Performance Computing Center (NHR@FAU) of the Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU). The hardware is funded by the German Research Foundation (DFG).”
  • for the NHR@FAU resources/projects: “The authors gratefully acknowledge the scientific support and HPC resources provided by the Erlangen National High Performance Computing Center (NHR@FAU) of the Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU) under the NHR project <ID of your NHR@FAU project>. NHR funding is provided by federal and Bavarian state authorities. NHR@FAU hardware is partially funded by the German Research Foundation (DFG) – 440719683.”

(Also do not forget to send a copy of your papers to nhr-redaktion@lists.fau.de!)

Alex Cluster

How can I access the new clusters Alex and Fritz?

FAU staff and students who already have an HPC account can request access to Alex here: https://hpc.fau.de/tier3-access-to-alex/ and access to Fritz here: https://hpc.fau.de/tier3-access-to-fritz/. Access is restricted to  projects with extended demands, thus, not feasible on TinyGPU or Woody/Meggie, but still below the NHR thresholds. You have to prove that and provide a short description of what you want to do there.

If you do not have an HPC account, please follow our instructions on “Getting started with HPC“.

External scientists have to submit a NHR proposal to get access.

How can I request an interactive job on Alex?

Interactive jobs can be requested by using salloc and specifying the respective options on the command line.

The following will give you an interactive shell on one of the A40 nodes for one hour:
salloc --gres=gpu:a40:1 --partition=a40 --time=01:00:00

Note that settings from the calling shell (e.g. loaded module paths) will be inherited by the interactive job!

This and more information can be found in our documentation on Alex.

How can I run my job on the cluster?

To submit a job to one of our cluster, you first have to login to a cluster frontend. The compute nodes are not directly accessible and we have a batch system running that handles the queuing of jobs into different partitions (depending on the needed resources, e.g. runtime) and sorting according to some priority scheme. A job will run when the required resources become available.

Please do not run your jobs on the cluster frontends!

The login nodes are not suitable for computational work, since they are shared among all users. We do not allow MPI-parallel applications on the frontends and short parallel test runs must be performed using batch jobs. It is possible to submit interactive batch jobs that, when started, open a shell on one of the assigned compute nodes and let you run interactive programs there.

Please consult our documentation for details about Batch Processing.

We also provide general job script examples for parallel jobs and GPU jobs; however, we have also prepared more specific job scripts for applications that our users frequently run on the clusters.

I have to analyze over 2 million files in my job. What can I do?

Please go through the presentation we provide on Using File Systems Properly; there is also a video recording available.

If supported by the application, use containerized formats (e.g. HDF5) or file-based databases. Otherwise, pack your files into an archive (e.g. tar + optional compression) and use node-local storage that is accessible via $TMPDIR.

The software I need is not installed. What can I do now?

On all HPC systems, established tools for software development (compilers, editors, …), libraries, and selected applications are available. For many of these applications, it is necessary to set special environment variables, so that e.g. search paths are correct or license servers can be found.

To ease selection of and switching between different versions of software packages, all HPC systems use the modules system (cf. modules.sourceforge.net). It allows to conveniently load the necessary configurations for different programs or different versions of the same program an, if necessary, unload them again later.

For information on how to use the modules system, please have a look into the respective section in our documentation about “Software environment“.

Even more packages will become visible once one of the 000-all-spack-pkgs modules has been loaded. Most of the software is installed using “Spack“ as enhanced HPC package manager.

Feel free to compile software in the versions and with the options you need yourself. This is perfectly fine, yet support for self-installed software cannot be granted. We only can provide software centrally which is of importance for multiple groups. If you want to use Spack for compiling additional software, you can load our user-spack module to make use of the packages we already build with Spack if the concretization matches instead of starting from scratch. Once user-spack is loaded, the command spack will be available (as alias), you will inherit the pre-sets we defined for certain packages (e.g. Open MPI to work with Slurm), but you’ll install everything into your own directories ($WORK/USER-SPACK).

You can also bring your own environment in a container using Singularity. However, building Singularity containers on the HPC systems themselves is not supported (as that would require root access). The Infiniband drivers from the host are not mounted into your container. All filesystems will also be available by default in the container. In certain use cases it might be a good idea to avoid bind-mounting your normal $HOME directory with all its “dot directories” into the container by explicitly specifying a different directory, e.g. -H $HOME/my-container-home.

Why is my code not using the GPU on Alex?

CUDA is not installed as part of the OS – you have to load a cuda module for your binaries to find libcublas, libcudnn, etc.

Please also have a look at our documentation “Working with NVIDIA GPUs“.

Basic HPC Knowledge

I have to analyze over 2 million files in my job. What can I do?

Please go through the presentation we provide on Using File Systems Properly; there is also a video recording available.

If supported by the application, use containerized formats (e.g. HDF5) or file-based databases. Otherwise, pack your files into an archive (e.g. tar + optional compression) and use node-local storage that is accessible via $TMPDIR.

What is a parallel file system?

In a parallel file system (PFS), data is distributed not only across several disks but also multiple servers in order to increase the data access bandwidth. Most PFS’s are connected to the high-speed network of a cluster, and aggregated bandwidths in the TByte/s range are not uncommon. High bandwidth can, however, only be obtained with large files and streaming access.

For information on how to use a parallel file system on our clusters, please read our documentation on “Parallel file systems $FASTTMP“.

What is SMT (also known as hyperthreading)?

Simultaneous multi-threading (SMT) allows a CPU core to run more than one software thread at the same time. These “hardware threads” a.k.a. “virtual cores” share almost all resources. The purpose of this feature is to make better use of the execution units within the core. It is rather hard to predict the benefit of SMT for real applications, so the best strategy is to try it using a well-designed, realistic benchmark case.

What is thread or process affinity?

Modern multicore systems have a strong topology, i.e., groups of hardware threads share different resources such as cores, caches, and memory interfaces. Many performance features of parallel programs depend on where their threads and processes are running in the machine. This makes it vital to bind these threads and processes to hardware threads so that variability is reduced and resources are balanced.

Why does my program give a http/https timeout?

When running software, which tries to connect to the internet, on one of the cluster nodes you might encounter time-out errors.
By default we do not allow cluster nodes to access the internet.
However, you can circumvent this by setting a proxy:
export http_proxy=http://proxy:80
export https_proxy=http://proxy:80

Why should I care about file systems?

Not only may efficient file operations speed up your own code (if file I/O is what you must do); they will also reduce the burden on shared file servers and thus leave more performance headroom for other users of the resource. Hence, it is a matter of thoughtfulness to optimize file accesses even if your performance gain is marginal.

Batch System

How can I request an interactive job on Alex?

Interactive jobs can be requested by using salloc and specifying the respective options on the command line.

The following will give you an interactive shell on one of the A40 nodes for one hour:
salloc --gres=gpu:a40:1 --partition=a40 --time=01:00:00

Note that settings from the calling shell (e.g. loaded module paths) will be inherited by the interactive job!

This and more information can be found in our documentation on Alex.

How can I request an interactive job on Fritz?

Interactive jobs can be requested by using salloc and specifying the respective options on the command line.

The following will give you an interactive shell on one node for one hour:
salloc -N 1 --partition=singlenode --time=01:00:00

The following will give you four nodes with an interactive shell on the first node for one hour:
salloc -N 4 --partition=multinode --time=01:00:00

Settings from the calling shell (e.g. loaded module paths) will be inherited by the interactive job!

This and more information can be found in our documentation on Fritz.

How can I request an interactive job on TinyGPU?

Interactive Slurm Shell (RTX2080Ti, RTX3080, V100 and A100 nodes only)

To generate an interactive Slurm shell on one of the compute nodes, the following command has to be issued on the woody frontend:
salloc.tinygpu --gres=gpu:1 --time=00:30:00

This will give you an interactive shell for 30 minutes on one of the nodes, allocating 1 GPU and the respective number of CPU cores. There, you can then for example compile your code or do test runs of your binary. For MPI-parallel binaries, use sruninstead of mpirun.

Please note that sallocautomatically exports the environment of your shell on the login node to your interactive job. This can cause problems if you have loaded any modules due to the version differences between the woody frontend and the TinyGPU compute nodes. To mitigate this, purge all loaded modules via module purge before issuing the salloc command.

This and more information can be found in our documentation about TinyGPU.

How can I request an interactive job on Woody-NG?

Interactive jobs can be requested by using salloc and specifying the respective options on the command line.

The following will give you an interactive shell on one node with one core dedicated to you for one hour:
salloc -n 1 --time=01:00:00

Settings from the calling shell (e.g. loaded module paths) will be inherited by the interactive job!

This and more information can be found in our documentation about Woody-NG.

How can I run my job on the cluster?

To submit a job to one of our cluster, you first have to login to a cluster frontend. The compute nodes are not directly accessible and we have a batch system running that handles the queuing of jobs into different partitions (depending on the needed resources, e.g. runtime) and sorting according to some priority scheme. A job will run when the required resources become available.

Please do not run your jobs on the cluster frontends!

The login nodes are not suitable for computational work, since they are shared among all users. We do not allow MPI-parallel applications on the frontends and short parallel test runs must be performed using batch jobs. It is possible to submit interactive batch jobs that, when started, open a shell on one of the assigned compute nodes and let you run interactive programs there.

Please consult our documentation for details about Batch Processing.

We also provide general job script examples for parallel jobs and GPU jobs; however, we have also prepared more specific job scripts for applications that our users frequently run on the clusters.

I have to analyze over 2 million files in my job. What can I do?

Please go through the presentation we provide on Using File Systems Properly; there is also a video recording available.

If supported by the application, use containerized formats (e.g. HDF5) or file-based databases. Otherwise, pack your files into an archive (e.g. tar + optional compression) and use node-local storage that is accessible via $TMPDIR.

CUDA

Why is my code not running on the GPU in TinyGPU?

Do not try to build a GPU-enabled Tensorflow, pytorch, … on the Woody login nodes. That will fail as the Woody login nodes do not have Nvidia software installed (moreover, most TinyGPU nodes run a newer OS version).

Please have look at our documentation about “Working with NVIDIA GPUs“.

Why is my code not using the GPU on Alex?

CUDA is not installed as part of the OS – you have to load a cuda module for your binaries to find libcublas, libcudnn, etc.

Please also have a look at our documentation “Working with NVIDIA GPUs“.

Cluster Access

How can I access the cluster frontends?

Almost all HPC systems at NHR@FAU use private IP addresses that can only be accessed directly from within the FAU.There are three options for logging in to the clusters from outside the university:

  1. Use a VPN (Virtual Private Network) connection.
  2. Use IPv6. The cluster frontends have world-visible IPv6 addresses.
  3. Use our “dialog server” cshpc.rrze.fau.de. The dialog server is the only HPC machine with a public IPv4 address. cshpc is a Linux system that is open to all HPC accounts. From this machine, you can log into any NHR@FAU system. A more complete description can be found on our documentation pages.

Whichever option you choose, you need to use SSH. Please consult our extensive SSH documentation pages for details.

 

How can I access the new clusters Alex and Fritz?

FAU staff and students who already have an HPC account can request access to Alex here: https://hpc.fau.de/tier3-access-to-alex/ and access to Fritz here: https://hpc.fau.de/tier3-access-to-fritz/. Access is restricted to  projects with extended demands, thus, not feasible on TinyGPU or Woody/Meggie, but still below the NHR thresholds. You have to prove that and provide a short description of what you want to do there.

If you do not have an HPC account, please follow our instructions on “Getting started with HPC“.

External scientists have to submit a NHR proposal to get access.

How can I get access to HPC systems?

Getting an HPC account

Depending on the status, there are different protocols to get an HPC account:

  • NHR users from outside FAU; See the page on NHR application rules for up-to-date information on allocating resources of NHR@FAU.
    Also check the pages on the NHR@FAU HPC-Portal Usage /New digital workflow for HPC accounts.
  • FAU staff and students (except for lectures): use the HPC application form. Details on how to fill the form are given below. Basic usage of the HPC systems typically is free of charge for FAU researchers for publicly funded research. For compute needs beyond the free basic usage see the page on NHR application rules for preliminary information on allocating resources of NHR@FAU.
  • Lectures of FAU with need for HPC access: there is a simplified protocol to get HPC accounts for all students of your course. Lecturer have to approach HPC support with a list of the IdM accounts of all students of the course and  the course name. Late registrations of additional students are not possible. Thus, be sure to collect all IdM accounts before sending the list to RRZE.
  • Block courses with external participants: Lecturer have to approach HPC support at least one week in advance with title and date of the course, and the expected number of participants. Such accounts cannot be valid for more than one week.

The HPC application form for FAU within HPC4FAU

You can get the application form here: HPC application form. Applications always have to be approved by your local chair / institute – we do not serve private persons. If you have any questions regarding the application, please contact your local IT contact person at the chair / institute.

You need to fill out the application form, print it, sign it and let it be stamped with the Chair or Institute seal.

Once it is ready, you can bring it by the RRZE Service Desk or send it via Email, or internal mail.

Please visit our documentation about Getting started with HPC for more information.

How can I run my job on the cluster?

To submit a job to one of our cluster, you first have to login to a cluster frontend. The compute nodes are not directly accessible and we have a batch system running that handles the queuing of jobs into different partitions (depending on the needed resources, e.g. runtime) and sorting according to some priority scheme. A job will run when the required resources become available.

Please do not run your jobs on the cluster frontends!

The login nodes are not suitable for computational work, since they are shared among all users. We do not allow MPI-parallel applications on the frontends and short parallel test runs must be performed using batch jobs. It is possible to submit interactive batch jobs that, when started, open a shell on one of the assigned compute nodes and let you run interactive programs there.

Please consult our documentation for details about Batch Processing.

We also provide general job script examples for parallel jobs and GPU jobs; however, we have also prepared more specific job scripts for applications that our users frequently run on the clusters.

How to acknowledge resource usage

In general, use the following formulation in publications for acknowledging the resources and the support of NHR@FAU:

  • for the FAU Tier3 resources: “The authors gratefully acknowledge the scientific support and HPC resources provided by the Erlangen National High Performance Computing Center (NHR@FAU) of the Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU). The hardware is funded by the German Research Foundation (DFG).”
  • for the NHR@FAU resources/projects: “The authors gratefully acknowledge the scientific support and HPC resources provided by the Erlangen National High Performance Computing Center (NHR@FAU) of the Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU) under the NHR project <ID of your NHR@FAU project>. NHR funding is provided by federal and Bavarian state authorities. NHR@FAU hardware is partially funded by the German Research Foundation (DFG) – 440719683.”

(Also do not forget to send a copy of your papers to nhr-redaktion@lists.fau.de!)

The software I need is not installed. What can I do now?

On all HPC systems, established tools for software development (compilers, editors, …), libraries, and selected applications are available. For many of these applications, it is necessary to set special environment variables, so that e.g. search paths are correct or license servers can be found.

To ease selection of and switching between different versions of software packages, all HPC systems use the modules system (cf. modules.sourceforge.net). It allows to conveniently load the necessary configurations for different programs or different versions of the same program an, if necessary, unload them again later.

For information on how to use the modules system, please have a look into the respective section in our documentation about “Software environment“.

Even more packages will become visible once one of the 000-all-spack-pkgs modules has been loaded. Most of the software is installed using “Spack“ as enhanced HPC package manager.

Feel free to compile software in the versions and with the options you need yourself. This is perfectly fine, yet support for self-installed software cannot be granted. We only can provide software centrally which is of importance for multiple groups. If you want to use Spack for compiling additional software, you can load our user-spack module to make use of the packages we already build with Spack if the concretization matches instead of starting from scratch. Once user-spack is loaded, the command spack will be available (as alias), you will inherit the pre-sets we defined for certain packages (e.g. Open MPI to work with Slurm), but you’ll install everything into your own directories ($WORK/USER-SPACK).

You can also bring your own environment in a container using Singularity. However, building Singularity containers on the HPC systems themselves is not supported (as that would require root access). The Infiniband drivers from the host are not mounted into your container. All filesystems will also be available by default in the container. In certain use cases it might be a good idea to avoid bind-mounting your normal $HOME directory with all its “dot directories” into the container by explicitly specifying a different directory, e.g. -H $HOME/my-container-home.

FileSystems/Data Storage

How can I leverage node-local storage on TinyGPU to increase job performance?

Each node has at least 880 GB of local SSD capacity for temporary files under $TMPDIR.

The directory $TMPDIR will be deleted automatically as soon as the user has no jobs running on the node any more.

Data to be kept can be copied to a cluster-wide volume at the end of the job.

Please also read our documentation on “File Systems“.

I have to analyze over 2 million files in my job. What can I do?

Please go through the presentation we provide on Using File Systems Properly; there is also a video recording available.

If supported by the application, use containerized formats (e.g. HDF5) or file-based databases. Otherwise, pack your files into an archive (e.g. tar + optional compression) and use node-local storage that is accessible via $TMPDIR.

What is a parallel file system?

In a parallel file system (PFS), data is distributed not only across several disks but also multiple servers in order to increase the data access bandwidth. Most PFS’s are connected to the high-speed network of a cluster, and aggregated bandwidths in the TByte/s range are not uncommon. High bandwidth can, however, only be obtained with large files and streaming access.

For information on how to use a parallel file system on our clusters, please read our documentation on “Parallel file systems $FASTTMP“.

Where can I store my data?

Your home directory is accessible via $HOME. Each user gets a standard quota of 50 Gigabytes and quota extensions are not possible.

Additional storage is accessible via $HPCVAULT. Here, the default quota for each user is 500 Gigabytes.

The recommended work directory is accessible via $WORK. The standard quota for each user is 500 Gigabytes.

All three directories ($HOME, $HPCVAULT and $WORK) are available throughout our HPC systems.

We recommend you use the aforementioned variables in your jobscripts and not rely on the specific paths as this may change over time, i.e. when directories are relocated to a different NFS server.

Job-specific storage (either located in main memory [RAM disk] or, if available, local HDD / SDD) is accessible via $TMPDIR and always node-local. Size differs between clusters and is only available during job lifetime. Data is flushed after the job finishes!

Some of our clusters have a local parallel filesystem for high performance short-term storage that is accessible via $FASTTMP. These filesystems are specific to the clusters and not available on other clusters. This type of storage is not suitable for programs such as MD simulations that have quite high output rates!

Please also have a look into our documentation on “File Systems“.

Why should I care about file systems?

Not only may efficient file operations speed up your own code (if file I/O is what you must do); they will also reduce the burden on shared file servers and thus leave more performance headroom for other users of the resource. Hence, it is a matter of thoughtfulness to optimize file accesses even if your performance gain is marginal.

Why the need for several file systems?

Different file systems have different features; for example, a central NFS server has massive bytes for the buck but limited data bandwidth, while a parallel file system is much faster but smaller and usually available to one cluster only. A node-local SSD, one the other hand, has the advantage of very low latency but it cannot be accessed from outside a compute node.

For further information, please consult our documentation on “File systems“.

Fritz Cluster

How can I access the new clusters Alex and Fritz?

FAU staff and students who already have an HPC account can request access to Alex here: https://hpc.fau.de/tier3-access-to-alex/ and access to Fritz here: https://hpc.fau.de/tier3-access-to-fritz/. Access is restricted to  projects with extended demands, thus, not feasible on TinyGPU or Woody/Meggie, but still below the NHR thresholds. You have to prove that and provide a short description of what you want to do there.

If you do not have an HPC account, please follow our instructions on “Getting started with HPC“.

External scientists have to submit a NHR proposal to get access.

How can I request an interactive job on Fritz?

Interactive jobs can be requested by using salloc and specifying the respective options on the command line.

The following will give you an interactive shell on one node for one hour:
salloc -N 1 --partition=singlenode --time=01:00:00

The following will give you four nodes with an interactive shell on the first node for one hour:
salloc -N 4 --partition=multinode --time=01:00:00

Settings from the calling shell (e.g. loaded module paths) will be inherited by the interactive job!

This and more information can be found in our documentation on Fritz.

How can I run my job on the cluster?

To submit a job to one of our cluster, you first have to login to a cluster frontend. The compute nodes are not directly accessible and we have a batch system running that handles the queuing of jobs into different partitions (depending on the needed resources, e.g. runtime) and sorting according to some priority scheme. A job will run when the required resources become available.

Please do not run your jobs on the cluster frontends!

The login nodes are not suitable for computational work, since they are shared among all users. We do not allow MPI-parallel applications on the frontends and short parallel test runs must be performed using batch jobs. It is possible to submit interactive batch jobs that, when started, open a shell on one of the assigned compute nodes and let you run interactive programs there.

Please consult our documentation for details about Batch Processing.

We also provide general job script examples for parallel jobs and GPU jobs; however, we have also prepared more specific job scripts for applications that our users frequently run on the clusters.

The software I need is not installed. What can I do now?

On all HPC systems, established tools for software development (compilers, editors, …), libraries, and selected applications are available. For many of these applications, it is necessary to set special environment variables, so that e.g. search paths are correct or license servers can be found.

To ease selection of and switching between different versions of software packages, all HPC systems use the modules system (cf. modules.sourceforge.net). It allows to conveniently load the necessary configurations for different programs or different versions of the same program an, if necessary, unload them again later.

For information on how to use the modules system, please have a look into the respective section in our documentation about “Software environment“.

Even more packages will become visible once one of the 000-all-spack-pkgs modules has been loaded. Most of the software is installed using “Spack“ as enhanced HPC package manager.

Feel free to compile software in the versions and with the options you need yourself. This is perfectly fine, yet support for self-installed software cannot be granted. We only can provide software centrally which is of importance for multiple groups. If you want to use Spack for compiling additional software, you can load our user-spack module to make use of the packages we already build with Spack if the concretization matches instead of starting from scratch. Once user-spack is loaded, the command spack will be available (as alias), you will inherit the pre-sets we defined for certain packages (e.g. Open MPI to work with Slurm), but you’ll install everything into your own directories ($WORK/USER-SPACK).

You can also bring your own environment in a container using Singularity. However, building Singularity containers on the HPC systems themselves is not supported (as that would require root access). The Infiniband drivers from the host are not mounted into your container. All filesystems will also be available by default in the container. In certain use cases it might be a good idea to avoid bind-mounting your normal $HOME directory with all its “dot directories” into the container by explicitly specifying a different directory, e.g. -H $HOME/my-container-home.

GPU usage

How can I run my job on the cluster?

To submit a job to one of our cluster, you first have to login to a cluster frontend. The compute nodes are not directly accessible and we have a batch system running that handles the queuing of jobs into different partitions (depending on the needed resources, e.g. runtime) and sorting according to some priority scheme. A job will run when the required resources become available.

Please do not run your jobs on the cluster frontends!

The login nodes are not suitable for computational work, since they are shared among all users. We do not allow MPI-parallel applications on the frontends and short parallel test runs must be performed using batch jobs. It is possible to submit interactive batch jobs that, when started, open a shell on one of the assigned compute nodes and let you run interactive programs there.

Please consult our documentation for details about Batch Processing.

We also provide general job script examples for parallel jobs and GPU jobs; however, we have also prepared more specific job scripts for applications that our users frequently run on the clusters.

Why is my code not running on the GPU in TinyGPU?

Do not try to build a GPU-enabled Tensorflow, pytorch, … on the Woody login nodes. That will fail as the Woody login nodes do not have Nvidia software installed (moreover, most TinyGPU nodes run a newer OS version).

Please have look at our documentation about “Working with NVIDIA GPUs“.

Why is my code not using the GPU on Alex?

CUDA is not installed as part of the OS – you have to load a cuda module for your binaries to find libcublas, libcudnn, etc.

Please also have a look at our documentation “Working with NVIDIA GPUs“.

Login

How can I access the cluster frontends?

Almost all HPC systems at NHR@FAU use private IP addresses that can only be accessed directly from within the FAU.There are three options for logging in to the clusters from outside the university:

  1. Use a VPN (Virtual Private Network) connection.
  2. Use IPv6. The cluster frontends have world-visible IPv6 addresses.
  3. Use our “dialog server” cshpc.rrze.fau.de. The dialog server is the only HPC machine with a public IPv4 address. cshpc is a Linux system that is open to all HPC accounts. From this machine, you can log into any NHR@FAU system. A more complete description can be found on our documentation pages.

Whichever option you choose, you need to use SSH. Please consult our extensive SSH documentation pages for details.

 

How can I run my job on the cluster?

To submit a job to one of our cluster, you first have to login to a cluster frontend. The compute nodes are not directly accessible and we have a batch system running that handles the queuing of jobs into different partitions (depending on the needed resources, e.g. runtime) and sorting according to some priority scheme. A job will run when the required resources become available.

Please do not run your jobs on the cluster frontends!

The login nodes are not suitable for computational work, since they are shared among all users. We do not allow MPI-parallel applications on the frontends and short parallel test runs must be performed using batch jobs. It is possible to submit interactive batch jobs that, when started, open a shell on one of the assigned compute nodes and let you run interactive programs there.

Please consult our documentation for details about Batch Processing.

We also provide general job script examples for parallel jobs and GPU jobs; however, we have also prepared more specific job scripts for applications that our users frequently run on the clusters.

Miscellaneous

How to acknowledge resource usage

In general, use the following formulation in publications for acknowledging the resources and the support of NHR@FAU:

  • for the FAU Tier3 resources: “The authors gratefully acknowledge the scientific support and HPC resources provided by the Erlangen National High Performance Computing Center (NHR@FAU) of the Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU). The hardware is funded by the German Research Foundation (DFG).”
  • for the NHR@FAU resources/projects: “The authors gratefully acknowledge the scientific support and HPC resources provided by the Erlangen National High Performance Computing Center (NHR@FAU) of the Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU) under the NHR project <ID of your NHR@FAU project>. NHR funding is provided by federal and Bavarian state authorities. NHR@FAU hardware is partially funded by the German Research Foundation (DFG) – 440719683.”

(Also do not forget to send a copy of your papers to nhr-redaktion@lists.fau.de!)

Password

SSH is asking for a password

If you got your HPC account through the new HPC portal (https://portal.hpc.fau.de) because it is for example an NHR project, there is NO password for such an HPC account. You log into the HPC portal using your SSO credentials (of your university). Access to the HPC systems with your HPC account created through the portal is by SSH keys only. The SSH public key is uploaded to the HPC portal and it will take a couple of hours until all HPC systems know a new/changed SSH public key. Multiple SSH public keys can be uploaded.

What is the password of my HPC account?

A) If you got your HPC account through the new HPC portal (https://portal.hpc.fau.de) because it is for example an NHR project, there is NO password for such an HPC account. You log into the HPC portal using your SSO credentials (of your university). Access to the HPC systems with your HPC account created through the portal is by SSH keys only. The SSH public key is uploaded to the HPC portal and it will take a couple of hours until all HPC systems know a new/changed SSH public key. Multiple SSH public keys can be uploaded.

B) If you have an FAU IdM account and applied for your HPC account with the HPC paper form, you can set a dedicated password for the HPC account through the IdM portal (https://idm.fau.de). It will take a couple of hours until all HPC systems know the changed password.

SSH

Debugging SSH problems

The get more information on SSH problems, add the “-v” option to SSH. This will give moderate debug information, e.g. show which SSH keys are tried.

Here is a sample output

max@notebook:~$ ssh -v cshpc
OpenSSH_7.6p1 Ubuntu-4ubuntu0.7, OpenSSL 1.0.2n 7 Dec 2017
debug1: Reading configuration data /home/max/.ssh/config
debug1: Reading configuration data /etc/ssh/ssh_config
debug1: /etc/ssh/ssh_config line 19: Applying options for *
debug1: Connecting to cshpc.rrze.uni-erlangen.de [131.188.3.39] port 22.
debug1: Connection established.
debug1: identity file /home/max/.ssh/id_rsa type 0
debug1: key_load_public: No such file or directory
debug1: identity file /home/max/.ssh/id_ecdsa type -1
debug1: key_load_public: No such file or directory
debug1: Local version string SSH-2.0-OpenSSH_7.6p1 Ubuntu-4ubuntu0.7
debug1: Remote protocol version 2.0, remote software version OpenSSH_8.2p1 Ubuntu-4ubuntu0.5
debug1: match: OpenSSH_8.2p1 Ubuntu-4ubuntu0.5 pat OpenSSH* compat 0x04000000
debug1: Authenticating to cshpc.rrze.uni-erlangen.de:22 as 'unrz143'
debug1: SSH2_MSG_KEXINIT sent
debug1: SSH2_MSG_KEXINIT received
debug1: kex: algorithm: curve25519-sha256
debug1: kex: host key algorithm: ecdsa-sha2-nistp256
debug1: kex: server->client cipher: chacha20-poly1305@openssh.com MAC: <implicit> compression: none
debug1: kex: client->server cipher: chacha20-poly1305@openssh.com MAC: <implicit> compression: none
debug1: expecting SSH2_MSG_KEX_ECDH_REPLY
debug1: Server host key: ecdsa-sha2-nistp256 SHA256:wFaDywle3yJvygQ4ZAPDsi/iSBTaF6Uoo0i0z727aJU
debug1: Host 'cshpc.rrze.uni-erlangen.de' is known and matches the ECDSA host key.
debug1: Found key in /home/max/.ssh/known_hosts:215
debug1: rekey after 134217728 blocks
debug1: SSH2_MSG_NEWKEYS sent
debug1: expecting SSH2_MSG_NEWKEYS
debug1: SSH2_MSG_NEWKEYS received
debug1: rekey after 134217728 blocks
debug1: SSH2_MSG_EXT_INFO received
debug1: kex_input_ext_info: server-sig-algs=<ssh-ed25519,sk-ssh-ed25519@openssh.com,ssh-rsa,rsa-sha2-256,rsa-sha2-512,ssh-dss,ecdsa-sha2-nistp256,ecdsa-sha2-nistp384,ecdsa-sha2-nistp521,sk-ecdsa-sha2-nistp256@openssh.com>
debug1: SSH2_MSG_SERVICE_ACCEPT received
debug1: Authentications that can continue: publickey,password
debug1: Next authentication method: publickey
debug1: Offering public key: RSA SHA256:mWO4eYar1/JYn8MDB0DPer+ibB/QatmhxvvngfaoMgQ /home/max/.ssh/id_rsa
debug1: Server accepts key: pkalg rsa-sha2-512 blen 277
debug1: Authentication succeeded (publickey).
Authenticated to cshpc.rrze.uni-erlangen.de ([131.188.3.39]:22).
debug1: channel 0: new [client-session]
debug1: Requesting no-more-sessions@openssh.com
debug1: Entering interactive session.
debug1: pledge: network
debug1: client_input_global_request: rtype hostkeys-00@openssh.com want_reply 0
debug1: Sending environment.
debug1: Sending env LC_ALL = en_US.UTF-8
debug1: Sending env LANG = en_US.UTF-8

To check the fingerprint of your SSH key, use

max@notebook:~$ ssh-keygen -l -f ~/.ssh/id_rsa
2048 SHA256:mWO4eYar1/JYn8MDB0DPer+ibB/QatmhxvvngfaoMgQ max@notebook (RSA)

This fingerprint must also match the data shown in the HPC portal (if your SSH keys are managed by the HPC portal).


In the debug output I find the following

debug1: Authentications that can continue: publickey,password
debug1: Next authentication method: publickey
debug1: Offering public key: /home/max/.ssh/id_rsa RSA SHA256:xCyJUQcsJldPWfZXSasoI0ZCoteKWHw1e95ylm2HK1g agent
debug1: Server accepts key: /home/max/.ssh/id_rsa RSA SHA256:xCyJUQcsJldPWfZXSasoI0ZCoteKWHw1e95ylm2HK1g agent
sign_and_send_pubkey: signing failed for RSA "/home/max/.ssh/id_rsa" from agent: agent refused operation
debug1: Next authentication method: password

This message is miss leading. “sign_and_send_pubkey: signing failedagent refused operation” typically means that you entered a wrong passphrase for the SSH key.

How can I attach to a running Slurm job

To attach to a running Slurm job, use srun --pty --jobid YOUR-JOBID bash. This will give you a shell on the first node of your job and you can run top, nvidia-smi, etc. to check your job.

This is an alternative to SSH-ing into your node.

Using srun to attach to a job is the only way to see the correct GPU if you have multiple GPU jobs running on a single node as SSH will always get you into last modified cgroup which might not be the job / GPUs you are looking for.

I managed to log in to cshpc (with an SSH key) but get asked for a password when continuing to a cluster frontend

The explanation is rather simple: the dialog server cshpc does not know any SSH (private) key from you, thus, fails to do SSH key-based authentication when connecting to one of the cluster frontends and, thus. tries password authentication as fallback.

There are a couple of solutions to mitigate that:

  1. Use the “jump host”/”proxy jump” feature of SSH and directly connect to the cluster frontends through the dialog server cshpc. To do this, either use the command line option “-j” of recent SSH versions or use an ~/.ssh/config file on your local computer. See https://hpc.fau.de/systems-services/documentation-instructions/ssh-secure-shell-access-to-hpc-systems/#ssh_config_hpc_portal for templates.
  2. Create an additional SSH key pair on cshpc and add the corresponding SSH public key to the HPC portal (if your account is already managed trough the HPC portal) – or add it to ~/.ssh/authorized_keys (which will only be a temporary solution until all HPC accounts are migrated to the HPC portal)
  3. Use an SSH agent on your local computer and allow it to forward its connection to our dialog server cshpc.

All there ways make sure that cshpc has a SSH private key available when connecting to the cluster frontends.

My HPC account just has been created but I cannot login or Slurm rejects my jobs

Home directories and entries to the Slurm data base are only done once per day (in the late morning). Thus, be patent and wait for the next day at 10 o’clock if your HPC account has just been (re)created.

As with SSH key updates (and password updates for legacy accounts), the processes run on different servers at different times. Thus, before the next day, some of your directories or services may have already been created while others aren’t.

My just updated SSH key (from the HPC portal) or password (from the IdM passwort) is not accepted

It always takes a couple of hours for updated SSH keys to be propagated to all HPC systems. As the clusters are synchronized at different points in time, it may happen that one system already knows the update while others don’t. It typically takes 2-4 hours for an updated to be propagated to all systems.

The same is true for the propagation of HPC password for accounts created through the FAU IdM portal using paper applications.

SSH is asking for a password

If you got your HPC account through the new HPC portal (https://portal.hpc.fau.de) because it is for example an NHR project, there is NO password for such an HPC account. You log into the HPC portal using your SSO credentials (of your university). Access to the HPC systems with your HPC account created through the portal is by SSH keys only. The SSH public key is uploaded to the HPC portal and it will take a couple of hours until all HPC systems know a new/changed SSH public key. Multiple SSH public keys can be uploaded.

What is the password of my HPC account?

A) If you got your HPC account through the new HPC portal (https://portal.hpc.fau.de) because it is for example an NHR project, there is NO password for such an HPC account. You log into the HPC portal using your SSO credentials (of your university). Access to the HPC systems with your HPC account created through the portal is by SSH keys only. The SSH public key is uploaded to the HPC portal and it will take a couple of hours until all HPC systems know a new/changed SSH public key. Multiple SSH public keys can be uploaded.

B) If you have an FAU IdM account and applied for your HPC account with the HPC paper form, you can set a dedicated password for the HPC account through the IdM portal (https://idm.fau.de). It will take a couple of hours until all HPC systems know the changed password.

Slurm

How can I attach to a running Slurm job

To attach to a running Slurm job, use srun --pty --jobid YOUR-JOBID bash. This will give you a shell on the first node of your job and you can run top, nvidia-smi, etc. to check your job.

This is an alternative to SSH-ing into your node.

Using srun to attach to a job is the only way to see the correct GPU if you have multiple GPU jobs running on a single node as SSH will always get you into last modified cgroup which might not be the job / GPUs you are looking for.

My HPC account just has been created but I cannot login or Slurm rejects my jobs

Home directories and entries to the Slurm data base are only done once per day (in the late morning). Thus, be patent and wait for the next day at 10 o’clock if your HPC account has just been (re)created.

As with SSH key updates (and password updates for legacy accounts), the processes run on different servers at different times. Thus, before the next day, some of your directories or services may have already been created while others aren’t.

Slurm options get ignored when given as sbatch command line arguments

I give some Slurm options as command line arguments to sbatch, but they are ignored!?

The syntax of sbatch is: sbatch [OPTIONS(0)...] [ : [OPTIONS(N)...]] script(0) [args(0)...]

Thus, options for sbatch have to be given before the batch script. Arguments given after the batch script are used as arguments for the batch script and not for sbatch.

Software environment

Error “module: command not found”

If the module command cannot be found that usually means that you did not invoke the bash shell with the option “-l” (lower case L).

Thus, job scripts, etc. should always start with

#!/bin/bash -l

How can I run my job on the cluster?

To submit a job to one of our cluster, you first have to login to a cluster frontend. The compute nodes are not directly accessible and we have a batch system running that handles the queuing of jobs into different partitions (depending on the needed resources, e.g. runtime) and sorting according to some priority scheme. A job will run when the required resources become available.

Please do not run your jobs on the cluster frontends!

The login nodes are not suitable for computational work, since they are shared among all users. We do not allow MPI-parallel applications on the frontends and short parallel test runs must be performed using batch jobs. It is possible to submit interactive batch jobs that, when started, open a shell on one of the assigned compute nodes and let you run interactive programs there.

Please consult our documentation for details about Batch Processing.

We also provide general job script examples for parallel jobs and GPU jobs; however, we have also prepared more specific job scripts for applications that our users frequently run on the clusters.

The software I need is not installed. What can I do now?

On all HPC systems, established tools for software development (compilers, editors, …), libraries, and selected applications are available. For many of these applications, it is necessary to set special environment variables, so that e.g. search paths are correct or license servers can be found.

To ease selection of and switching between different versions of software packages, all HPC systems use the modules system (cf. modules.sourceforge.net). It allows to conveniently load the necessary configurations for different programs or different versions of the same program an, if necessary, unload them again later.

For information on how to use the modules system, please have a look into the respective section in our documentation about “Software environment“.

Even more packages will become visible once one of the 000-all-spack-pkgs modules has been loaded. Most of the software is installed using “Spack“ as enhanced HPC package manager.

Feel free to compile software in the versions and with the options you need yourself. This is perfectly fine, yet support for self-installed software cannot be granted. We only can provide software centrally which is of importance for multiple groups. If you want to use Spack for compiling additional software, you can load our user-spack module to make use of the packages we already build with Spack if the concretization matches instead of starting from scratch. Once user-spack is loaded, the command spack will be available (as alias), you will inherit the pre-sets we defined for certain packages (e.g. Open MPI to work with Slurm), but you’ll install everything into your own directories ($WORK/USER-SPACK).

You can also bring your own environment in a container using Singularity. However, building Singularity containers on the HPC systems themselves is not supported (as that would require root access). The Infiniband drivers from the host are not mounted into your container. All filesystems will also be available by default in the container. In certain use cases it might be a good idea to avoid bind-mounting your normal $HOME directory with all its “dot directories” into the container by explicitly specifying a different directory, e.g. -H $HOME/my-container-home.

Why does my program give a http/https timeout?

When running software, which tries to connect to the internet, on one of the cluster nodes you might encounter time-out errors.
By default we do not allow cluster nodes to access the internet.
However, you can circumvent this by setting a proxy:
export http_proxy=http://proxy:80
export https_proxy=http://proxy:80

Why is my code not running on the GPU in TinyGPU?

Do not try to build a GPU-enabled Tensorflow, pytorch, … on the Woody login nodes. That will fail as the Woody login nodes do not have Nvidia software installed (moreover, most TinyGPU nodes run a newer OS version).

Please have look at our documentation about “Working with NVIDIA GPUs“.

Why is my code not using the GPU on Alex?

CUDA is not installed as part of the OS – you have to load a cuda module for your binaries to find libcublas, libcudnn, etc.

Please also have a look at our documentation “Working with NVIDIA GPUs“.

TinyGPU Cluster

How can I leverage node-local storage on TinyGPU to increase job performance?

Each node has at least 880 GB of local SSD capacity for temporary files under $TMPDIR.

The directory $TMPDIR will be deleted automatically as soon as the user has no jobs running on the node any more.

Data to be kept can be copied to a cluster-wide volume at the end of the job.

Please also read our documentation on “File Systems“.

How can I request an interactive job on TinyGPU?

Interactive Slurm Shell (RTX2080Ti, RTX3080, V100 and A100 nodes only)

To generate an interactive Slurm shell on one of the compute nodes, the following command has to be issued on the woody frontend:
salloc.tinygpu --gres=gpu:1 --time=00:30:00

This will give you an interactive shell for 30 minutes on one of the nodes, allocating 1 GPU and the respective number of CPU cores. There, you can then for example compile your code or do test runs of your binary. For MPI-parallel binaries, use sruninstead of mpirun.

Please note that sallocautomatically exports the environment of your shell on the login node to your interactive job. This can cause problems if you have loaded any modules due to the version differences between the woody frontend and the TinyGPU compute nodes. To mitigate this, purge all loaded modules via module purge before issuing the salloc command.

This and more information can be found in our documentation about TinyGPU.

How can I run my job on the cluster?

To submit a job to one of our cluster, you first have to login to a cluster frontend. The compute nodes are not directly accessible and we have a batch system running that handles the queuing of jobs into different partitions (depending on the needed resources, e.g. runtime) and sorting according to some priority scheme. A job will run when the required resources become available.

Please do not run your jobs on the cluster frontends!

The login nodes are not suitable for computational work, since they are shared among all users. We do not allow MPI-parallel applications on the frontends and short parallel test runs must be performed using batch jobs. It is possible to submit interactive batch jobs that, when started, open a shell on one of the assigned compute nodes and let you run interactive programs there.

Please consult our documentation for details about Batch Processing.

We also provide general job script examples for parallel jobs and GPU jobs; however, we have also prepared more specific job scripts for applications that our users frequently run on the clusters.

I have to analyze over 2 million files in my job. What can I do?

Please go through the presentation we provide on Using File Systems Properly; there is also a video recording available.

If supported by the application, use containerized formats (e.g. HDF5) or file-based databases. Otherwise, pack your files into an archive (e.g. tar + optional compression) and use node-local storage that is accessible via $TMPDIR.

The software I need is not installed. What can I do now?

On all HPC systems, established tools for software development (compilers, editors, …), libraries, and selected applications are available. For many of these applications, it is necessary to set special environment variables, so that e.g. search paths are correct or license servers can be found.

To ease selection of and switching between different versions of software packages, all HPC systems use the modules system (cf. modules.sourceforge.net). It allows to conveniently load the necessary configurations for different programs or different versions of the same program an, if necessary, unload them again later.

For information on how to use the modules system, please have a look into the respective section in our documentation about “Software environment“.

Even more packages will become visible once one of the 000-all-spack-pkgs modules has been loaded. Most of the software is installed using “Spack“ as enhanced HPC package manager.

Feel free to compile software in the versions and with the options you need yourself. This is perfectly fine, yet support for self-installed software cannot be granted. We only can provide software centrally which is of importance for multiple groups. If you want to use Spack for compiling additional software, you can load our user-spack module to make use of the packages we already build with Spack if the concretization matches instead of starting from scratch. Once user-spack is loaded, the command spack will be available (as alias), you will inherit the pre-sets we defined for certain packages (e.g. Open MPI to work with Slurm), but you’ll install everything into your own directories ($WORK/USER-SPACK).

You can also bring your own environment in a container using Singularity. However, building Singularity containers on the HPC systems themselves is not supported (as that would require root access). The Infiniband drivers from the host are not mounted into your container. All filesystems will also be available by default in the container. In certain use cases it might be a good idea to avoid bind-mounting your normal $HOME directory with all its “dot directories” into the container by explicitly specifying a different directory, e.g. -H $HOME/my-container-home.

Why is my code not running on the GPU in TinyGPU?

Do not try to build a GPU-enabled Tensorflow, pytorch, … on the Woody login nodes. That will fail as the Woody login nodes do not have Nvidia software installed (moreover, most TinyGPU nodes run a newer OS version).

Please have look at our documentation about “Working with NVIDIA GPUs“.

Woody Cluster

How can I request an interactive job on Woody-NG?

Interactive jobs can be requested by using salloc and specifying the respective options on the command line.

The following will give you an interactive shell on one node with one core dedicated to you for one hour:
salloc -n 1 --time=01:00:00

Settings from the calling shell (e.g. loaded module paths) will be inherited by the interactive job!

This and more information can be found in our documentation about Woody-NG.

How can I run my job on the cluster?

To submit a job to one of our cluster, you first have to login to a cluster frontend. The compute nodes are not directly accessible and we have a batch system running that handles the queuing of jobs into different partitions (depending on the needed resources, e.g. runtime) and sorting according to some priority scheme. A job will run when the required resources become available.

Please do not run your jobs on the cluster frontends!

The login nodes are not suitable for computational work, since they are shared among all users. We do not allow MPI-parallel applications on the frontends and short parallel test runs must be performed using batch jobs. It is possible to submit interactive batch jobs that, when started, open a shell on one of the assigned compute nodes and let you run interactive programs there.

Please consult our documentation for details about Batch Processing.

We also provide general job script examples for parallel jobs and GPU jobs; however, we have also prepared more specific job scripts for applications that our users frequently run on the clusters.

The software I need is not installed. What can I do now?

On all HPC systems, established tools for software development (compilers, editors, …), libraries, and selected applications are available. For many of these applications, it is necessary to set special environment variables, so that e.g. search paths are correct or license servers can be found.

To ease selection of and switching between different versions of software packages, all HPC systems use the modules system (cf. modules.sourceforge.net). It allows to conveniently load the necessary configurations for different programs or different versions of the same program an, if necessary, unload them again later.

For information on how to use the modules system, please have a look into the respective section in our documentation about “Software environment“.

Even more packages will become visible once one of the 000-all-spack-pkgs modules has been loaded. Most of the software is installed using “Spack“ as enhanced HPC package manager.

Feel free to compile software in the versions and with the options you need yourself. This is perfectly fine, yet support for self-installed software cannot be granted. We only can provide software centrally which is of importance for multiple groups. If you want to use Spack for compiling additional software, you can load our user-spack module to make use of the packages we already build with Spack if the concretization matches instead of starting from scratch. Once user-spack is loaded, the command spack will be available (as alias), you will inherit the pre-sets we defined for certain packages (e.g. Open MPI to work with Slurm), but you’ll install everything into your own directories ($WORK/USER-SPACK).

You can also bring your own environment in a container using Singularity. However, building Singularity containers on the HPC systems themselves is not supported (as that would require root access). The Infiniband drivers from the host are not mounted into your container. All filesystems will also be available by default in the container. In certain use cases it might be a good idea to avoid bind-mounting your normal $HOME directory with all its “dot directories” into the container by explicitly specifying a different directory, e.g. -H $HOME/my-container-home.

How to acknowledge resource usage
In general, use the following formulation in publications for acknowledging the resources and the support of NHR@FAU: for the FAU Tier3 resources: “The authors gratefully acknowledge the scientific support and HPC resources provided by the Erlangen National High Performance Computing Center (NHR@FAU) of the Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU). The hardware is funded by the German Research Foundation (DFG).” for the NHR@FAU resources/projects: “The authors gratefully acknowledge the scientific support and HPC resources provided by the Erlangen National High Performance Computing Center (NHR@FAU) of the Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU) under the NHR project <ID of your NHR@FAU project>. NHR funding is provided by federal and Bavarian state authorities. NHR@FAU hardware is partially funded by the German Research Foundation (DFG) – 440719683.” (Also do not forget to send a copy of your papers to nhr-redaktion@lists.fau.de!)
How can I access the new clusters Alex and Fritz?
FAU staff and students who already have an HPC account can request access to Alex here: https://hpc.fau.de/tier3-access-to-alex/ and access to Fritz here: https://hpc.fau.de/tier3-access-to-fritz/. Access is restricted to  projects with extended demands, thus, not feasible on TinyGPU or Woody/Meggie, but still below the NHR thresholds. You have to prove that and provide a short description of what you want to do there. If you do not have an HPC account, please follow our instructions on “Getting started with HPC“. External scientists have to submit a NHR proposal to get access.
How can I request an interactive job on Alex?
Interactive jobs can be requested by using salloc and specifying the respective options on the command line. The following will give you an interactive shell on one of the A40 nodes for one hour: salloc --gres=gpu:a40:1 --partition=a40 --time=01:00:00 Note that settings from the calling shell (e.g. loaded module paths) will be inherited by the interactive job! This and more information can be found in our documentation on Alex.
How can I run my job on the cluster?
To submit a job to one of our cluster, you first have to login to a cluster frontend. The compute nodes are not directly accessible and we have a batch system running that handles the queuing of jobs into different partitions (depending on the needed resources, e.g. runtime) and sorting according to some priority scheme. A job will run when the required resources become available. Please do not run your jobs on the cluster frontends! The login nodes are not suitable for computational work, since they are shared among all users. We do not allow MPI-parallel applications on the frontends and short parallel test runs must be performed using batch jobs. It is possible to submit interactive batch jobs that, when started, open a shell on one of the assigned compute nodes and let you run interactive programs there. Please consult our documentation for details about Batch Processing. We also provide general job script examples for parallel jobs and GPU jobs; however, we have also prepared more specific job scripts for applications that our users frequently run on the clusters.
I have to analyze over 2 million files in my job. What can I do?
Please go through the presentation we provide on Using File Systems Properly; there is also a video recording available. If supported by the application, use containerized formats (e.g. HDF5) or file-based databases. Otherwise, pack your files into an archive (e.g. tar + optional compression) and use node-local storage that is accessible via $TMPDIR.
The software I need is not installed. What can I do now?
On all HPC systems, established tools for software development (compilers, editors, …), libraries, and selected applications are available. For many of these applications, it is necessary to set special environment variables, so that e.g. search paths are correct or license servers can be found. To ease selection of and switching between different versions of software packages, all HPC systems use the modules system (cf. modules.sourceforge.net). It allows to conveniently load the necessary configurations for different programs or different versions of the same program an, if necessary, unload them again later. For information on how to use the modules system, please have a look into the respective section in our documentation about “Software environment“. Even more packages will become visible once one of the 000-all-spack-pkgs modules has been loaded. Most of the software is installed using “Spack“ as enhanced HPC package manager. Feel free to compile software in the versions and with the options you need yourself. This is perfectly fine, yet support for self-installed software cannot be granted. We only can provide software centrally which is of importance for multiple groups. If you want to use Spack for compiling additional software, you can load our user-spack module to make use of the packages we already build with Spack if the concretization matches instead of starting from scratch. Once user-spack is loaded, the command spack will be available (as alias), you will inherit the pre-sets we defined for certain packages (e.g. Open MPI to work with Slurm), but you’ll install everything into your own directories ($WORK/USER-SPACK). You can also bring your own environment in a container using Singularity. However, building Singularity containers on the HPC systems themselves is not supported (as that would require root access). The Infiniband drivers from the host are not mounted into your container. All filesystems will also be available by default in the container. In certain use cases it might be a good idea to avoid bind-mounting your normal $HOME directory with all its “dot directories” into the container by explicitly specifying a different directory, e.g. -H $HOME/my-container-home.
Why is my code not using the GPU on Alex?
CUDA is not installed as part of the OS – you have to load a cuda module for your binaries to find libcublas, libcudnn, etc. Please also have a look at our documentation “Working with NVIDIA GPUs“.
I have to analyze over 2 million files in my job. What can I do?
Please go through the presentation we provide on Using File Systems Properly; there is also a video recording available. If supported by the application, use containerized formats (e.g. HDF5) or file-based databases. Otherwise, pack your files into an archive (e.g. tar + optional compression) and use node-local storage that is accessible via $TMPDIR.
What is a parallel file system?
In a parallel file system (PFS), data is distributed not only across several disks but also multiple servers in order to increase the data access bandwidth. Most PFS’s are connected to the high-speed network of a cluster, and aggregated bandwidths in the TByte/s range are not uncommon. High bandwidth can, however, only be obtained with large files and streaming access. For information on how to use a parallel file system on our clusters, please read our documentation on “Parallel file systems $FASTTMP“.
What is SMT (also known as hyperthreading)?
Simultaneous multi-threading (SMT) allows a CPU core to run more than one software thread at the same time. These “hardware threads” a.k.a. “virtual cores” share almost all resources. The purpose of this feature is to make better use of the execution units within the core. It is rather hard to predict the benefit of SMT for real applications, so the best strategy is to try it using a well-designed, realistic benchmark case.
What is thread or process affinity?
Modern multicore systems have a strong topology, i.e., groups of hardware threads share different resources such as cores, caches, and memory interfaces. Many performance features of parallel programs depend on where their threads and processes are running in the machine. This makes it vital to bind these threads and processes to hardware threads so that variability is reduced and resources are balanced.
Why does my program give a http/https timeout?
When running software, which tries to connect to the internet, on one of the cluster nodes you might encounter time-out errors. By default we do not allow cluster nodes to access the internet. However, you can circumvent this by setting a proxy: export http_proxy=http://proxy:80 export https_proxy=http://proxy:80
Why should I care about file systems?
Not only may efficient file operations speed up your own code (if file I/O is what you must do); they will also reduce the burden on shared file servers and thus leave more performance headroom for other users of the resource. Hence, it is a matter of thoughtfulness to optimize file accesses even if your performance gain is marginal.
How can I request an interactive job on Alex?
Interactive jobs can be requested by using salloc and specifying the respective options on the command line. The following will give you an interactive shell on one of the A40 nodes for one hour: salloc --gres=gpu:a40:1 --partition=a40 --time=01:00:00 Note that settings from the calling shell (e.g. loaded module paths) will be inherited by the interactive job! This and more information can be found in our documentation on Alex.
How can I request an interactive job on Fritz?
Interactive jobs can be requested by using salloc and specifying the respective options on the command line. The following will give you an interactive shell on one node for one hour: salloc -N 1 --partition=singlenode --time=01:00:00 The following will give you four nodes with an interactive shell on the first node for one hour: salloc -N 4 --partition=multinode --time=01:00:00 Settings from the calling shell (e.g. loaded module paths) will be inherited by the interactive job! This and more information can be found in our documentation on Fritz.
How can I request an interactive job on TinyGPU?
Interactive Slurm Shell (RTX2080Ti, RTX3080, V100 and A100 nodes only) To generate an interactive Slurm shell on one of the compute nodes, the following command has to be issued on the woody frontend: salloc.tinygpu --gres=gpu:1 --time=00:30:00 This will give you an interactive shell for 30 minutes on one of the nodes, allocating 1 GPU and the respective number of CPU cores. There, you can then for example compile your code or do test runs of your binary. For MPI-parallel binaries, use sruninstead of mpirun. Please note that sallocautomatically exports the environment of your shell on the login node to your interactive job. This can cause problems if you have loaded any modules due to the version differences between the woody frontend and the TinyGPU compute nodes. To mitigate this, purge all loaded modules via module purge before issuing the salloc command. This and more information can be found in our documentation about TinyGPU.
How can I request an interactive job on Woody-NG?
Interactive jobs can be requested by using salloc and specifying the respective options on the command line. The following will give you an interactive shell on one node with one core dedicated to you for one hour: salloc -n 1 --time=01:00:00 Settings from the calling shell (e.g. loaded module paths) will be inherited by the interactive job! This and more information can be found in our documentation about Woody-NG.
How can I run my job on the cluster?
To submit a job to one of our cluster, you first have to login to a cluster frontend. The compute nodes are not directly accessible and we have a batch system running that handles the queuing of jobs into different partitions (depending on the needed resources, e.g. runtime) and sorting according to some priority scheme. A job will run when the required resources become available. Please do not run your jobs on the cluster frontends! The login nodes are not suitable for computational work, since they are shared among all users. We do not allow MPI-parallel applications on the frontends and short parallel test runs must be performed using batch jobs. It is possible to submit interactive batch jobs that, when started, open a shell on one of the assigned compute nodes and let you run interactive programs there. Please consult our documentation for details about Batch Processing. We also provide general job script examples for parallel jobs and GPU jobs; however, we have also prepared more specific job scripts for applications that our users frequently run on the clusters.
I have to analyze over 2 million files in my job. What can I do?
Please go through the presentation we provide on Using File Systems Properly; there is also a video recording available. If supported by the application, use containerized formats (e.g. HDF5) or file-based databases. Otherwise, pack your files into an archive (e.g. tar + optional compression) and use node-local storage that is accessible via $TMPDIR.
Why is my code not running on the GPU in TinyGPU?
Do not try to build a GPU-enabled Tensorflow, pytorch, … on the Woody login nodes. That will fail as the Woody login nodes do not have Nvidia software installed (moreover, most TinyGPU nodes run a newer OS version). Please have look at our documentation about “Working with NVIDIA GPUs“.
Why is my code not using the GPU on Alex?
CUDA is not installed as part of the OS – you have to load a cuda module for your binaries to find libcublas, libcudnn, etc. Please also have a look at our documentation “Working with NVIDIA GPUs“.
How can I access the cluster frontends?
Almost all HPC systems at NHR@FAU use private IP addresses that can only be accessed directly from within the FAU.There are three options for logging in to the clusters from outside the university: Use a VPN (Virtual Private Network) connection. Use IPv6. The cluster frontends have world-visible IPv6 addresses. Use our “dialog server” cshpc.rrze.fau.de. The dialog server is the only HPC machine with a public IPv4 address. cshpc is a Linux system that is open to all HPC accounts. From this machine, you can log into any NHR@FAU system. A more complete description can be found on our documentation pages. Whichever option you choose, you need to use SSH. Please consult our extensive SSH documentation pages for details.  
How can I access the new clusters Alex and Fritz?
FAU staff and students who already have an HPC account can request access to Alex here: https://hpc.fau.de/tier3-access-to-alex/ and access to Fritz here: https://hpc.fau.de/tier3-access-to-fritz/. Access is restricted to  projects with extended demands, thus, not feasible on TinyGPU or Woody/Meggie, but still below the NHR thresholds. You have to prove that and provide a short description of what you want to do there. If you do not have an HPC account, please follow our instructions on “Getting started with HPC“. External scientists have to submit a NHR proposal to get access.
How can I get access to HPC systems?
Getting an HPC account Depending on the status, there are different protocols to get an HPC account: NHR users from outside FAU; See the page on NHR application rules for up-to-date information on allocating resources of NHR@FAU. Also check the pages on the NHR@FAU HPC-Portal Usage /New digital workflow for HPC accounts. FAU staff and students (except for lectures): use the HPC application form. Details on how to fill the form are given below. Basic usage of the HPC systems typically is free of charge for FAU researchers for publicly funded research. For compute needs beyond the free basic usage see the page on NHR application rules for preliminary information on allocating resources of NHR@FAU. Lectures of FAU with need for HPC access: there is a simplified protocol to get HPC accounts for all students of your course. Lecturer have to approach HPC support with a list of the IdM accounts of all students of the course and  the course name. Late registrations of additional students are not possible. Thus, be sure to collect all IdM accounts before sending the list to RRZE. Block courses with external participants: Lecturer have to approach HPC support at least one week in advance with title and date of the course, and the expected number of participants. Such accounts cannot be valid for more than one week. The HPC application form for FAU within HPC4FAU You can get the application form here: HPC application form. Applications always have to be approved by your local chair / institute – we do not serve private persons. If you have any questions regarding the application, please contact your local IT contact person at the chair / institute. You need to fill out the application form, print it, sign it and let it be stamped with the Chair or Institute seal. Once it is ready, you can bring it by the RRZE Service Desk or send it via Email, or internal mail. Please visit our documentation about Getting started with HPC for more information.
How can I run my job on the cluster?
To submit a job to one of our cluster, you first have to login to a cluster frontend. The compute nodes are not directly accessible and we have a batch system running that handles the queuing of jobs into different partitions (depending on the needed resources, e.g. runtime) and sorting according to some priority scheme. A job will run when the required resources become available. Please do not run your jobs on the cluster frontends! The login nodes are not suitable for computational work, since they are shared among all users. We do not allow MPI-parallel applications on the frontends and short parallel test runs must be performed using batch jobs. It is possible to submit interactive batch jobs that, when started, open a shell on one of the assigned compute nodes and let you run interactive programs there. Please consult our documentation for details about Batch Processing. We also provide general job script examples for parallel jobs and GPU jobs; however, we have also prepared more specific job scripts for applications that our users frequently run on the clusters.
How to acknowledge resource usage
In general, use the following formulation in publications for acknowledging the resources and the support of NHR@FAU: for the FAU Tier3 resources: “The authors gratefully acknowledge the scientific support and HPC resources provided by the Erlangen National High Performance Computing Center (NHR@FAU) of the Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU). The hardware is funded by the German Research Foundation (DFG).” for the NHR@FAU resources/projects: “The authors gratefully acknowledge the scientific support and HPC resources provided by the Erlangen National High Performance Computing Center (NHR@FAU) of the Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU) under the NHR project <ID of your NHR@FAU project>. NHR funding is provided by federal and Bavarian state authorities. NHR@FAU hardware is partially funded by the German Research Foundation (DFG) – 440719683.” (Also do not forget to send a copy of your papers to nhr-redaktion@lists.fau.de!)
The software I need is not installed. What can I do now?
On all HPC systems, established tools for software development (compilers, editors, …), libraries, and selected applications are available. For many of these applications, it is necessary to set special environment variables, so that e.g. search paths are correct or license servers can be found. To ease selection of and switching between different versions of software packages, all HPC systems use the modules system (cf. modules.sourceforge.net). It allows to conveniently load the necessary configurations for different programs or different versions of the same program an, if necessary, unload them again later. For information on how to use the modules system, please have a look into the respective section in our documentation about “Software environment“. Even more packages will become visible once one of the 000-all-spack-pkgs modules has been loaded. Most of the software is installed using “Spack“ as enhanced HPC package manager. Feel free to compile software in the versions and with the options you need yourself. This is perfectly fine, yet support for self-installed software cannot be granted. We only can provide software centrally which is of importance for multiple groups. If you want to use Spack for compiling additional software, you can load our user-spack module to make use of the packages we already build with Spack if the concretization matches instead of starting from scratch. Once user-spack is loaded, the command spack will be available (as alias), you will inherit the pre-sets we defined for certain packages (e.g. Open MPI to work with Slurm), but you’ll install everything into your own directories ($WORK/USER-SPACK). You can also bring your own environment in a container using Singularity. However, building Singularity containers on the HPC systems themselves is not supported (as that would require root access). The Infiniband drivers from the host are not mounted into your container. All filesystems will also be available by default in the container. In certain use cases it might be a good idea to avoid bind-mounting your normal $HOME directory with all its “dot directories” into the container by explicitly specifying a different directory, e.g. -H $HOME/my-container-home.
How can I leverage node-local storage on TinyGPU to increase job performance?
Each node has at least 880 GB of local SSD capacity for temporary files under $TMPDIR. The directory $TMPDIR will be deleted automatically as soon as the user has no jobs running on the node any more. Data to be kept can be copied to a cluster-wide volume at the end of the job. Please also read our documentation on “File Systems“.
I have to analyze over 2 million files in my job. What can I do?
Please go through the presentation we provide on Using File Systems Properly; there is also a video recording available. If supported by the application, use containerized formats (e.g. HDF5) or file-based databases. Otherwise, pack your files into an archive (e.g. tar + optional compression) and use node-local storage that is accessible via $TMPDIR.
What is a parallel file system?
In a parallel file system (PFS), data is distributed not only across several disks but also multiple servers in order to increase the data access bandwidth. Most PFS’s are connected to the high-speed network of a cluster, and aggregated bandwidths in the TByte/s range are not uncommon. High bandwidth can, however, only be obtained with large files and streaming access. For information on how to use a parallel file system on our clusters, please read our documentation on “Parallel file systems $FASTTMP“.
Where can I store my data?
Your home directory is accessible via $HOME. Each user gets a standard quota of 50 Gigabytes and quota extensions are not possible. Additional storage is accessible via $HPCVAULT. Here, the default quota for each user is 500 Gigabytes. The recommended work directory is accessible via $WORK. The standard quota for each user is 500 Gigabytes. All three directories ($HOME, $HPCVAULT and $WORK) are available throughout our HPC systems. We recommend you use the aforementioned variables in your jobscripts and not rely on the specific paths as this may change over time, i.e. when directories are relocated to a different NFS server. Job-specific storage (either located in main memory [RAM disk] or, if available, local HDD / SDD) is accessible via $TMPDIR and always node-local. Size differs between clusters and is only available during job lifetime. Data is flushed after the job finishes! Some of our clusters have a local parallel filesystem for high performance short-term storage that is accessible via $FASTTMP. These filesystems are specific to the clusters and not available on other clusters. This type of storage is not suitable for programs such as MD simulations that have quite high output rates! Please also have a look into our documentation on “File Systems“.
Why should I care about file systems?
Not only may efficient file operations speed up your own code (if file I/O is what you must do); they will also reduce the burden on shared file servers and thus leave more performance headroom for other users of the resource. Hence, it is a matter of thoughtfulness to optimize file accesses even if your performance gain is marginal.
Why the need for several file systems?
Different file systems have different features; for example, a central NFS server has massive bytes for the buck but limited data bandwidth, while a parallel file system is much faster but smaller and usually available to one cluster only. A node-local SSD, one the other hand, has the advantage of very low latency but it cannot be accessed from outside a compute node. For further information, please consult our documentation on “File systems“.
How can I access the new clusters Alex and Fritz?
FAU staff and students who already have an HPC account can request access to Alex here: https://hpc.fau.de/tier3-access-to-alex/ and access to Fritz here: https://hpc.fau.de/tier3-access-to-fritz/. Access is restricted to  projects with extended demands, thus, not feasible on TinyGPU or Woody/Meggie, but still below the NHR thresholds. You have to prove that and provide a short description of what you want to do there. If you do not have an HPC account, please follow our instructions on “Getting started with HPC“. External scientists have to submit a NHR proposal to get access.
How can I request an interactive job on Fritz?
Interactive jobs can be requested by using salloc and specifying the respective options on the command line. The following will give you an interactive shell on one node for one hour: salloc -N 1 --partition=singlenode --time=01:00:00 The following will give you four nodes with an interactive shell on the first node for one hour: salloc -N 4 --partition=multinode --time=01:00:00 Settings from the calling shell (e.g. loaded module paths) will be inherited by the interactive job! This and more information can be found in our documentation on Fritz.
How can I run my job on the cluster?
To submit a job to one of our cluster, you first have to login to a cluster frontend. The compute nodes are not directly accessible and we have a batch system running that handles the queuing of jobs into different partitions (depending on the needed resources, e.g. runtime) and sorting according to some priority scheme. A job will run when the required resources become available. Please do not run your jobs on the cluster frontends! The login nodes are not suitable for computational work, since they are shared among all users. We do not allow MPI-parallel applications on the frontends and short parallel test runs must be performed using batch jobs. It is possible to submit interactive batch jobs that, when started, open a shell on one of the assigned compute nodes and let you run interactive programs there. Please consult our documentation for details about Batch Processing. We also provide general job script examples for parallel jobs and GPU jobs; however, we have also prepared more specific job scripts for applications that our users frequently run on the clusters.
The software I need is not installed. What can I do now?
On all HPC systems, established tools for software development (compilers, editors, …), libraries, and selected applications are available. For many of these applications, it is necessary to set special environment variables, so that e.g. search paths are correct or license servers can be found. To ease selection of and switching between different versions of software packages, all HPC systems use the modules system (cf. modules.sourceforge.net). It allows to conveniently load the necessary configurations for different programs or different versions of the same program an, if necessary, unload them again later. For information on how to use the modules system, please have a look into the respective section in our documentation about “Software environment“. Even more packages will become visible once one of the 000-all-spack-pkgs modules has been loaded. Most of the software is installed using “Spack“ as enhanced HPC package manager. Feel free to compile software in the versions and with the options you need yourself. This is perfectly fine, yet support for self-installed software cannot be granted. We only can provide software centrally which is of importance for multiple groups. If you want to use Spack for compiling additional software, you can load our user-spack module to make use of the packages we already build with Spack if the concretization matches instead of starting from scratch. Once user-spack is loaded, the command spack will be available (as alias), you will inherit the pre-sets we defined for certain packages (e.g. Open MPI to work with Slurm), but you’ll install everything into your own directories ($WORK/USER-SPACK). You can also bring your own environment in a container using Singularity. However, building Singularity containers on the HPC systems themselves is not supported (as that would require root access). The Infiniband drivers from the host are not mounted into your container. All filesystems will also be available by default in the container. In certain use cases it might be a good idea to avoid bind-mounting your normal $HOME directory with all its “dot directories” into the container by explicitly specifying a different directory, e.g. -H $HOME/my-container-home.
How can I run my job on the cluster?
To submit a job to one of our cluster, you first have to login to a cluster frontend. The compute nodes are not directly accessible and we have a batch system running that handles the queuing of jobs into different partitions (depending on the needed resources, e.g. runtime) and sorting according to some priority scheme. A job will run when the required resources become available. Please do not run your jobs on the cluster frontends! The login nodes are not suitable for computational work, since they are shared among all users. We do not allow MPI-parallel applications on the frontends and short parallel test runs must be performed using batch jobs. It is possible to submit interactive batch jobs that, when started, open a shell on one of the assigned compute nodes and let you run interactive programs there. Please consult our documentation for details about Batch Processing. We also provide general job script examples for parallel jobs and GPU jobs; however, we have also prepared more specific job scripts for applications that our users frequently run on the clusters.
Why is my code not running on the GPU in TinyGPU?
Do not try to build a GPU-enabled Tensorflow, pytorch, … on the Woody login nodes. That will fail as the Woody login nodes do not have Nvidia software installed (moreover, most TinyGPU nodes run a newer OS version). Please have look at our documentation about “Working with NVIDIA GPUs“.
Why is my code not using the GPU on Alex?
CUDA is not installed as part of the OS – you have to load a cuda module for your binaries to find libcublas, libcudnn, etc. Please also have a look at our documentation “Working with NVIDIA GPUs“.
How can I access the cluster frontends?
Almost all HPC systems at NHR@FAU use private IP addresses that can only be accessed directly from within the FAU.There are three options for logging in to the clusters from outside the university: Use a VPN (Virtual Private Network) connection. Use IPv6. The cluster frontends have world-visible IPv6 addresses. Use our “dialog server” cshpc.rrze.fau.de. The dialog server is the only HPC machine with a public IPv4 address. cshpc is a Linux system that is open to all HPC accounts. From this machine, you can log into any NHR@FAU system. A more complete description can be found on our documentation pages. Whichever option you choose, you need to use SSH. Please consult our extensive SSH documentation pages for details.  
How can I run my job on the cluster?
To submit a job to one of our cluster, you first have to login to a cluster frontend. The compute nodes are not directly accessible and we have a batch system running that handles the queuing of jobs into different partitions (depending on the needed resources, e.g. runtime) and sorting according to some priority scheme. A job will run when the required resources become available. Please do not run your jobs on the cluster frontends! The login nodes are not suitable for computational work, since they are shared among all users. We do not allow MPI-parallel applications on the frontends and short parallel test runs must be performed using batch jobs. It is possible to submit interactive batch jobs that, when started, open a shell on one of the assigned compute nodes and let you run interactive programs there. Please consult our documentation for details about Batch Processing. We also provide general job script examples for parallel jobs and GPU jobs; however, we have also prepared more specific job scripts for applications that our users frequently run on the clusters.
How to acknowledge resource usage
In general, use the following formulation in publications for acknowledging the resources and the support of NHR@FAU: for the FAU Tier3 resources: “The authors gratefully acknowledge the scientific support and HPC resources provided by the Erlangen National High Performance Computing Center (NHR@FAU) of the Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU). The hardware is funded by the German Research Foundation (DFG).” for the NHR@FAU resources/projects: “The authors gratefully acknowledge the scientific support and HPC resources provided by the Erlangen National High Performance Computing Center (NHR@FAU) of the Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU) under the NHR project <ID of your NHR@FAU project>. NHR funding is provided by federal and Bavarian state authorities. NHR@FAU hardware is partially funded by the German Research Foundation (DFG) – 440719683.” (Also do not forget to send a copy of your papers to nhr-redaktion@lists.fau.de!)
SSH is asking for a password
If you got your HPC account through the new HPC portal (https://portal.hpc.fau.de) because it is for example an NHR project, there is NO password for such an HPC account. You log into the HPC portal using your SSO credentials (of your university). Access to the HPC systems with your HPC account created through the portal is by SSH keys only. The SSH public key is uploaded to the HPC portal and it will take a couple of hours until all HPC systems know a new/changed SSH public key. Multiple SSH public keys can be uploaded.
What is the password of my HPC account?
A) If you got your HPC account through the new HPC portal (https://portal.hpc.fau.de) because it is for example an NHR project, there is NO password for such an HPC account. You log into the HPC portal using your SSO credentials (of your university). Access to the HPC systems with your HPC account created through the portal is by SSH keys only. The SSH public key is uploaded to the HPC portal and it will take a couple of hours until all HPC systems know a new/changed SSH public key. Multiple SSH public keys can be uploaded. B) If you have an FAU IdM account and applied for your HPC account with the HPC paper form, you can set a dedicated password for the HPC account through the IdM portal (https://idm.fau.de). It will take a couple of hours until all HPC systems know the changed password.
Debugging SSH problems
The get more information on SSH problems, add the “-v” option to SSH. This will give moderate debug information, e.g. show which SSH keys are tried. Here is a sample output max@notebook:~$ ssh -v cshpc OpenSSH_7.6p1 Ubuntu-4ubuntu0.7, OpenSSL 1.0.2n 7 Dec 2017 debug1: Reading configuration data /home/max/.ssh/config debug1: Reading configuration data /etc/ssh/ssh_config debug1: /etc/ssh/ssh_config line 19: Applying options for * debug1: Connecting to cshpc.rrze.uni-erlangen.de [131.188.3.39] port 22. debug1: Connection established. debug1: identity file /home/max/.ssh/id_rsa type 0 debug1: key_load_public: No such file or directory debug1: identity file /home/max/.ssh/id_ecdsa type -1 debug1: key_load_public: No such file or directory debug1: Local version string SSH-2.0-OpenSSH_7.6p1 Ubuntu-4ubuntu0.7 debug1: Remote protocol version 2.0, remote software version OpenSSH_8.2p1 Ubuntu-4ubuntu0.5 debug1: match: OpenSSH_8.2p1 Ubuntu-4ubuntu0.5 pat OpenSSH* compat 0x04000000 debug1: Authenticating to cshpc.rrze.uni-erlangen.de:22 as 'unrz143' debug1: SSH2_MSG_KEXINIT sent debug1: SSH2_MSG_KEXINIT received debug1: kex: algorithm: curve25519-sha256 debug1: kex: host key algorithm: ecdsa-sha2-nistp256 debug1: kex: server->client cipher: chacha20-poly1305@openssh.com MAC: <implicit> compression: none debug1: kex: client->server cipher: chacha20-poly1305@openssh.com MAC: <implicit> compression: none debug1: expecting SSH2_MSG_KEX_ECDH_REPLY debug1: Server host key: ecdsa-sha2-nistp256 SHA256:wFaDywle3yJvygQ4ZAPDsi/iSBTaF6Uoo0i0z727aJU debug1: Host 'cshpc.rrze.uni-erlangen.de' is known and matches the ECDSA host key. debug1: Found key in /home/max/.ssh/known_hosts:215 debug1: rekey after 134217728 blocks debug1: SSH2_MSG_NEWKEYS sent debug1: expecting SSH2_MSG_NEWKEYS debug1: SSH2_MSG_NEWKEYS received debug1: rekey after 134217728 blocks debug1: SSH2_MSG_EXT_INFO received debug1: kex_input_ext_info: server-sig-algs=<ssh-ed25519,sk-ssh-ed25519@openssh.com,ssh-rsa,rsa-sha2-256,rsa-sha2-512,ssh-dss,ecdsa-sha2-nistp256,ecdsa-sha2-nistp384,ecdsa-sha2-nistp521,sk-ecdsa-sha2-nistp256@openssh.com> debug1: SSH2_MSG_SERVICE_ACCEPT received debug1: Authentications that can continue: publickey,password debug1: Next authentication method: publickey debug1: Offering public key: RSA SHA256:mWO4eYar1/JYn8MDB0DPer+ibB/QatmhxvvngfaoMgQ /home/max/.ssh/id_rsa debug1: Server accepts key: pkalg rsa-sha2-512 blen 277 debug1: Authentication succeeded (publickey). Authenticated to cshpc.rrze.uni-erlangen.de ([131.188.3.39]:22). debug1: channel 0: new [client-session] debug1: Requesting no-more-sessions@openssh.com debug1: Entering interactive session. debug1: pledge: network debug1: client_input_global_request: rtype hostkeys-00@openssh.com want_reply 0 debug1: Sending environment. debug1: Sending env LC_ALL = en_US.UTF-8 debug1: Sending env LANG = en_US.UTF-8 To check the fingerprint of your SSH key, use max@notebook:~$ ssh-keygen -l -f ~/.ssh/id_rsa 2048 SHA256:mWO4eYar1/JYn8MDB0DPer+ibB/QatmhxvvngfaoMgQ max@notebook (RSA) This fingerprint must also match the data shown in the HPC portal (if your SSH keys are managed by the HPC portal). In the debug output I find the following debug1: Authentications that can continue: publickey,password debug1: Next authentication method: publickey debug1: Offering public key: /home/max/.ssh/id_rsa RSA SHA256:xCyJUQcsJldPWfZXSasoI0ZCoteKWHw1e95ylm2HK1g agent debug1: Server accepts key: /home/max/.ssh/id_rsa RSA SHA256:xCyJUQcsJldPWfZXSasoI0ZCoteKWHw1e95ylm2HK1g agent sign_and_send_pubkey: signing failed for RSA "/home/max/.ssh/id_rsa" from agent: agent refused operation debug1: Next authentication method: password This message is miss leading. “sign_and_send_pubkey: signing failed … agent refused operation” typically means that you entered a wrong passphrase for the SSH key.
How can I attach to a running Slurm job
To attach to a running Slurm job, use srun --pty --jobid YOUR-JOBID bash. This will give you a shell on the first node of your job and you can run top, nvidia-smi, etc. to check your job. This is an alternative to SSH-ing into your node. Using srun to attach to a job is the only way to see the correct GPU if you have multiple GPU jobs running on a single node as SSH will always get you into last modified cgroup which might not be the job / GPUs you are looking for.
I managed to log in to cshpc (with an SSH key) but get asked for a password when continuing to a cluster frontend
The explanation is rather simple: the dialog server cshpc does not know any SSH (private) key from you, thus, fails to do SSH key-based authentication when connecting to one of the cluster frontends and, thus. tries password authentication as fallback. There are a couple of solutions to mitigate that: Use the “jump host”/”proxy jump” feature of SSH and directly connect to the cluster frontends through the dialog server cshpc. To do this, either use the command line option “-j” of recent SSH versions or use an ~/.ssh/config file on your local computer. See https://hpc.fau.de/systems-services/documentation-instructions/ssh-secure-shell-access-to-hpc-systems/#ssh_config_hpc_portal for templates. Create an additional SSH key pair on cshpc and add the corresponding SSH public key to the HPC portal (if your account is already managed trough the HPC portal) – or add it to ~/.ssh/authorized_keys (which will only be a temporary solution until all HPC accounts are migrated to the HPC portal) Use an SSH agent on your local computer and allow it to forward its connection to our dialog server cshpc. All there ways make sure that cshpc has a SSH private key available when connecting to the cluster frontends.
My HPC account just has been created but I cannot login or Slurm rejects my jobs
Home directories and entries to the Slurm data base are only done once per day (in the late morning). Thus, be patent and wait for the next day at 10 o’clock if your HPC account has just been (re)created. As with SSH key updates (and password updates for legacy accounts), the processes run on different servers at different times. Thus, before the next day, some of your directories or services may have already been created while others aren’t.
My just updated SSH key (from the HPC portal) or password (from the IdM passwort) is not accepted
It always takes a couple of hours for updated SSH keys to be propagated to all HPC systems. As the clusters are synchronized at different points in time, it may happen that one system already knows the update while others don’t. It typically takes 2-4 hours for an updated to be propagated to all systems. The same is true for the propagation of HPC password for accounts created through the FAU IdM portal using paper applications.
SSH is asking for a password
If you got your HPC account through the new HPC portal (https://portal.hpc.fau.de) because it is for example an NHR project, there is NO password for such an HPC account. You log into the HPC portal using your SSO credentials (of your university). Access to the HPC systems with your HPC account created through the portal is by SSH keys only. The SSH public key is uploaded to the HPC portal and it will take a couple of hours until all HPC systems know a new/changed SSH public key. Multiple SSH public keys can be uploaded.
What is the password of my HPC account?
A) If you got your HPC account through the new HPC portal (https://portal.hpc.fau.de) because it is for example an NHR project, there is NO password for such an HPC account. You log into the HPC portal using your SSO credentials (of your university). Access to the HPC systems with your HPC account created through the portal is by SSH keys only. The SSH public key is uploaded to the HPC portal and it will take a couple of hours until all HPC systems know a new/changed SSH public key. Multiple SSH public keys can be uploaded. B) If you have an FAU IdM account and applied for your HPC account with the HPC paper form, you can set a dedicated password for the HPC account through the IdM portal (https://idm.fau.de). It will take a couple of hours until all HPC systems know the changed password.
How can I attach to a running Slurm job
To attach to a running Slurm job, use srun --pty --jobid YOUR-JOBID bash. This will give you a shell on the first node of your job and you can run top, nvidia-smi, etc. to check your job. This is an alternative to SSH-ing into your node. Using srun to attach to a job is the only way to see the correct GPU if you have multiple GPU jobs running on a single node as SSH will always get you into last modified cgroup which might not be the job / GPUs you are looking for.
My HPC account just has been created but I cannot login or Slurm rejects my jobs
Home directories and entries to the Slurm data base are only done once per day (in the late morning). Thus, be patent and wait for the next day at 10 o’clock if your HPC account has just been (re)created. As with SSH key updates (and password updates for legacy accounts), the processes run on different servers at different times. Thus, before the next day, some of your directories or services may have already been created while others aren’t.
Slurm options get ignored when given as sbatch command line arguments
I give some Slurm options as command line arguments to sbatch, but they are ignored!? The syntax of sbatch is: sbatch [OPTIONS(0)...] [ : [OPTIONS(N)...]] script(0) [args(0)...] Thus, options for sbatch have to be given before the batch script. Arguments given after the batch script are used as arguments for the batch script and not for sbatch.
Error “module: command not found”
If the module command cannot be found that usually means that you did not invoke the bash shell with the option “-l” (lower case L). Thus, job scripts, etc. should always start with #!/bin/bash -l
How can I run my job on the cluster?
To submit a job to one of our cluster, you first have to login to a cluster frontend. The compute nodes are not directly accessible and we have a batch system running that handles the queuing of jobs into different partitions (depending on the needed resources, e.g. runtime) and sorting according to some priority scheme. A job will run when the required resources become available. Please do not run your jobs on the cluster frontends! The login nodes are not suitable for computational work, since they are shared among all users. We do not allow MPI-parallel applications on the frontends and short parallel test runs must be performed using batch jobs. It is possible to submit interactive batch jobs that, when started, open a shell on one of the assigned compute nodes and let you run interactive programs there. Please consult our documentation for details about Batch Processing. We also provide general job script examples for parallel jobs and GPU jobs; however, we have also prepared more specific job scripts for applications that our users frequently run on the clusters.
The software I need is not installed. What can I do now?
On all HPC systems, established tools for software development (compilers, editors, …), libraries, and selected applications are available. For many of these applications, it is necessary to set special environment variables, so that e.g. search paths are correct or license servers can be found. To ease selection of and switching between different versions of software packages, all HPC systems use the modules system (cf. modules.sourceforge.net). It allows to conveniently load the necessary configurations for different programs or different versions of the same program an, if necessary, unload them again later. For information on how to use the modules system, please have a look into the respective section in our documentation about “Software environment“. Even more packages will become visible once one of the 000-all-spack-pkgs modules has been loaded. Most of the software is installed using “Spack“ as enhanced HPC package manager. Feel free to compile software in the versions and with the options you need yourself. This is perfectly fine, yet support for self-installed software cannot be granted. We only can provide software centrally which is of importance for multiple groups. If you want to use Spack for compiling additional software, you can load our user-spack module to make use of the packages we already build with Spack if the concretization matches instead of starting from scratch. Once user-spack is loaded, the command spack will be available (as alias), you will inherit the pre-sets we defined for certain packages (e.g. Open MPI to work with Slurm), but you’ll install everything into your own directories ($WORK/USER-SPACK). You can also bring your own environment in a container using Singularity. However, building Singularity containers on the HPC systems themselves is not supported (as that would require root access). The Infiniband drivers from the host are not mounted into your container. All filesystems will also be available by default in the container. In certain use cases it might be a good idea to avoid bind-mounting your normal $HOME directory with all its “dot directories” into the container by explicitly specifying a different directory, e.g. -H $HOME/my-container-home.
Why does my program give a http/https timeout?
When running software, which tries to connect to the internet, on one of the cluster nodes you might encounter time-out errors. By default we do not allow cluster nodes to access the internet. However, you can circumvent this by setting a proxy: export http_proxy=http://proxy:80 export https_proxy=http://proxy:80
Why is my code not running on the GPU in TinyGPU?
Do not try to build a GPU-enabled Tensorflow, pytorch, … on the Woody login nodes. That will fail as the Woody login nodes do not have Nvidia software installed (moreover, most TinyGPU nodes run a newer OS version). Please have look at our documentation about “Working with NVIDIA GPUs“.
Why is my code not using the GPU on Alex?
CUDA is not installed as part of the OS – you have to load a cuda module for your binaries to find libcublas, libcudnn, etc. Please also have a look at our documentation “Working with NVIDIA GPUs“.
How can I leverage node-local storage on TinyGPU to increase job performance?
Each node has at least 880 GB of local SSD capacity for temporary files under $TMPDIR. The directory $TMPDIR will be deleted automatically as soon as the user has no jobs running on the node any more. Data to be kept can be copied to a cluster-wide volume at the end of the job. Please also read our documentation on “File Systems“.
How can I request an interactive job on TinyGPU?
Interactive Slurm Shell (RTX2080Ti, RTX3080, V100 and A100 nodes only) To generate an interactive Slurm shell on one of the compute nodes, the following command has to be issued on the woody frontend: salloc.tinygpu --gres=gpu:1 --time=00:30:00 This will give you an interactive shell for 30 minutes on one of the nodes, allocating 1 GPU and the respective number of CPU cores. There, you can then for example compile your code or do test runs of your binary. For MPI-parallel binaries, use sruninstead of mpirun. Please note that sallocautomatically exports the environment of your shell on the login node to your interactive job. This can cause problems if you have loaded any modules due to the version differences between the woody frontend and the TinyGPU compute nodes. To mitigate this, purge all loaded modules via module purge before issuing the salloc command. This and more information can be found in our documentation about TinyGPU.
How can I run my job on the cluster?
To submit a job to one of our cluster, you first have to login to a cluster frontend. The compute nodes are not directly accessible and we have a batch system running that handles the queuing of jobs into different partitions (depending on the needed resources, e.g. runtime) and sorting according to some priority scheme. A job will run when the required resources become available. Please do not run your jobs on the cluster frontends! The login nodes are not suitable for computational work, since they are shared among all users. We do not allow MPI-parallel applications on the frontends and short parallel test runs must be performed using batch jobs. It is possible to submit interactive batch jobs that, when started, open a shell on one of the assigned compute nodes and let you run interactive programs there. Please consult our documentation for details about Batch Processing. We also provide general job script examples for parallel jobs and GPU jobs; however, we have also prepared more specific job scripts for applications that our users frequently run on the clusters.
I have to analyze over 2 million files in my job. What can I do?
Please go through the presentation we provide on Using File Systems Properly; there is also a video recording available. If supported by the application, use containerized formats (e.g. HDF5) or file-based databases. Otherwise, pack your files into an archive (e.g. tar + optional compression) and use node-local storage that is accessible via $TMPDIR.
The software I need is not installed. What can I do now?
On all HPC systems, established tools for software development (compilers, editors, …), libraries, and selected applications are available. For many of these applications, it is necessary to set special environment variables, so that e.g. search paths are correct or license servers can be found. To ease selection of and switching between different versions of software packages, all HPC systems use the modules system (cf. modules.sourceforge.net). It allows to conveniently load the necessary configurations for different programs or different versions of the same program an, if necessary, unload them again later. For information on how to use the modules system, please have a look into the respective section in our documentation about “Software environment“. Even more packages will become visible once one of the 000-all-spack-pkgs modules has been loaded. Most of the software is installed using “Spack“ as enhanced HPC package manager. Feel free to compile software in the versions and with the options you need yourself. This is perfectly fine, yet support for self-installed software cannot be granted. We only can provide software centrally which is of importance for multiple groups. If you want to use Spack for compiling additional software, you can load our user-spack module to make use of the packages we already build with Spack if the concretization matches instead of starting from scratch. Once user-spack is loaded, the command spack will be available (as alias), you will inherit the pre-sets we defined for certain packages (e.g. Open MPI to work with Slurm), but you’ll install everything into your own directories ($WORK/USER-SPACK). You can also bring your own environment in a container using Singularity. However, building Singularity containers on the HPC systems themselves is not supported (as that would require root access). The Infiniband drivers from the host are not mounted into your container. All filesystems will also be available by default in the container. In certain use cases it might be a good idea to avoid bind-mounting your normal $HOME directory with all its “dot directories” into the container by explicitly specifying a different directory, e.g. -H $HOME/my-container-home.
Why is my code not running on the GPU in TinyGPU?
Do not try to build a GPU-enabled Tensorflow, pytorch, … on the Woody login nodes. That will fail as the Woody login nodes do not have Nvidia software installed (moreover, most TinyGPU nodes run a newer OS version). Please have look at our documentation about “Working with NVIDIA GPUs“.
How can I request an interactive job on Woody-NG?
Interactive jobs can be requested by using salloc and specifying the respective options on the command line. The following will give you an interactive shell on one node with one core dedicated to you for one hour: salloc -n 1 --time=01:00:00 Settings from the calling shell (e.g. loaded module paths) will be inherited by the interactive job! This and more information can be found in our documentation about Woody-NG.
How can I run my job on the cluster?
To submit a job to one of our cluster, you first have to login to a cluster frontend. The compute nodes are not directly accessible and we have a batch system running that handles the queuing of jobs into different partitions (depending on the needed resources, e.g. runtime) and sorting according to some priority scheme. A job will run when the required resources become available. Please do not run your jobs on the cluster frontends! The login nodes are not suitable for computational work, since they are shared among all users. We do not allow MPI-parallel applications on the frontends and short parallel test runs must be performed using batch jobs. It is possible to submit interactive batch jobs that, when started, open a shell on one of the assigned compute nodes and let you run interactive programs there. Please consult our documentation for details about Batch Processing. We also provide general job script examples for parallel jobs and GPU jobs; however, we have also prepared more specific job scripts for applications that our users frequently run on the clusters.
The software I need is not installed. What can I do now?
On all HPC systems, established tools for software development (compilers, editors, …), libraries, and selected applications are available. For many of these applications, it is necessary to set special environment variables, so that e.g. search paths are correct or license servers can be found. To ease selection of and switching between different versions of software packages, all HPC systems use the modules system (cf. modules.sourceforge.net). It allows to conveniently load the necessary configurations for different programs or different versions of the same program an, if necessary, unload them again later. For information on how to use the modules system, please have a look into the respective section in our documentation about “Software environment“. Even more packages will become visible once one of the 000-all-spack-pkgs modules has been loaded. Most of the software is installed using “Spack“ as enhanced HPC package manager. Feel free to compile software in the versions and with the options you need yourself. This is perfectly fine, yet support for self-installed software cannot be granted. We only can provide software centrally which is of importance for multiple groups. If you want to use Spack for compiling additional software, you can load our user-spack module to make use of the packages we already build with Spack if the concretization matches instead of starting from scratch. Once user-spack is loaded, the command spack will be available (as alias), you will inherit the pre-sets we defined for certain packages (e.g. Open MPI to work with Slurm), but you’ll install everything into your own directories ($WORK/USER-SPACK). You can also bring your own environment in a container using Singularity. However, building Singularity containers on the HPC systems themselves is not supported (as that would require root access). The Infiniband drivers from the host are not mounted into your container. All filesystems will also be available by default in the container. In certain use cases it might be a good idea to avoid bind-mounting your normal $HOME directory with all its “dot directories” into the container by explicitly specifying a different directory, e.g. -H $HOME/my-container-home.

Software

Alex Cluster

I have to analyze over 2 million files in my job. What can I do?

Please go through the presentation we provide on Using File Systems Properly; there is also a video recording available.

If supported by the application, use containerized formats (e.g. HDF5) or file-based databases. Otherwise, pack your files into an archive (e.g. tar + optional compression) and use node-local storage that is accessible via $TMPDIR.

The software I need is not installed. What can I do now?

On all HPC systems, established tools for software development (compilers, editors, …), libraries, and selected applications are available. For many of these applications, it is necessary to set special environment variables, so that e.g. search paths are correct or license servers can be found.

To ease selection of and switching between different versions of software packages, all HPC systems use the modules system (cf. modules.sourceforge.net). It allows to conveniently load the necessary configurations for different programs or different versions of the same program an, if necessary, unload them again later.

For information on how to use the modules system, please have a look into the respective section in our documentation about “Software environment“.

Even more packages will become visible once one of the 000-all-spack-pkgs modules has been loaded. Most of the software is installed using “Spack“ as enhanced HPC package manager.

Feel free to compile software in the versions and with the options you need yourself. This is perfectly fine, yet support for self-installed software cannot be granted. We only can provide software centrally which is of importance for multiple groups. If you want to use Spack for compiling additional software, you can load our user-spack module to make use of the packages we already build with Spack if the concretization matches instead of starting from scratch. Once user-spack is loaded, the command spack will be available (as alias), you will inherit the pre-sets we defined for certain packages (e.g. Open MPI to work with Slurm), but you’ll install everything into your own directories ($WORK/USER-SPACK).

You can also bring your own environment in a container using Singularity. However, building Singularity containers on the HPC systems themselves is not supported (as that would require root access). The Infiniband drivers from the host are not mounted into your container. All filesystems will also be available by default in the container. In certain use cases it might be a good idea to avoid bind-mounting your normal $HOME directory with all its “dot directories” into the container by explicitly specifying a different directory, e.g. -H $HOME/my-container-home.

Why is my code not using the GPU on Alex?

CUDA is not installed as part of the OS – you have to load a cuda module for your binaries to find libcublas, libcudnn, etc.

Please also have a look at our documentation “Working with NVIDIA GPUs“.

Basic HPC Knowledge

I have to analyze over 2 million files in my job. What can I do?

Please go through the presentation we provide on Using File Systems Properly; there is also a video recording available.

If supported by the application, use containerized formats (e.g. HDF5) or file-based databases. Otherwise, pack your files into an archive (e.g. tar + optional compression) and use node-local storage that is accessible via $TMPDIR.

Batch System

I have to analyze over 2 million files in my job. What can I do?

Please go through the presentation we provide on Using File Systems Properly; there is also a video recording available.

If supported by the application, use containerized formats (e.g. HDF5) or file-based databases. Otherwise, pack your files into an archive (e.g. tar + optional compression) and use node-local storage that is accessible via $TMPDIR.

CUDA

Why is my code not running on the GPU in TinyGPU?

Do not try to build a GPU-enabled Tensorflow, pytorch, … on the Woody login nodes. That will fail as the Woody login nodes do not have Nvidia software installed (moreover, most TinyGPU nodes run a newer OS version).

Please have look at our documentation about “Working with NVIDIA GPUs“.

Why is my code not using the GPU on Alex?

CUDA is not installed as part of the OS – you have to load a cuda module for your binaries to find libcublas, libcudnn, etc.

Please also have a look at our documentation “Working with NVIDIA GPUs“.

Cluster Access

The software I need is not installed. What can I do now?

On all HPC systems, established tools for software development (compilers, editors, …), libraries, and selected applications are available. For many of these applications, it is necessary to set special environment variables, so that e.g. search paths are correct or license servers can be found.

To ease selection of and switching between different versions of software packages, all HPC systems use the modules system (cf. modules.sourceforge.net). It allows to conveniently load the necessary configurations for different programs or different versions of the same program an, if necessary, unload them again later.

For information on how to use the modules system, please have a look into the respective section in our documentation about “Software environment“.

Even more packages will become visible once one of the 000-all-spack-pkgs modules has been loaded. Most of the software is installed using “Spack“ as enhanced HPC package manager.

Feel free to compile software in the versions and with the options you need yourself. This is perfectly fine, yet support for self-installed software cannot be granted. We only can provide software centrally which is of importance for multiple groups. If you want to use Spack for compiling additional software, you can load our user-spack module to make use of the packages we already build with Spack if the concretization matches instead of starting from scratch. Once user-spack is loaded, the command spack will be available (as alias), you will inherit the pre-sets we defined for certain packages (e.g. Open MPI to work with Slurm), but you’ll install everything into your own directories ($WORK/USER-SPACK).

You can also bring your own environment in a container using Singularity. However, building Singularity containers on the HPC systems themselves is not supported (as that would require root access). The Infiniband drivers from the host are not mounted into your container. All filesystems will also be available by default in the container. In certain use cases it might be a good idea to avoid bind-mounting your normal $HOME directory with all its “dot directories” into the container by explicitly specifying a different directory, e.g. -H $HOME/my-container-home.

Continuous X

What is Continuous Benchmarking (CB)?

CB can be seen as a variant of CT, where not only functionality but also performance is tested in order to avoid regressions, i.e., unwanted performance degradation due to code changes.

Please also see our documentation on “Continuous Integration / Gitlab Cx“.

What is Continuous Deployment (CD)?

CD is the automatic deployment of the software coming out of the other Cx processes. This can be the installation on a particular system, rolling out a revision within a whole organization, pushing installation packages to public repositories, etc.

Please also see our documentation on “Continuous Integration / Gitlab Cx“.

What is Continuous Integration (CI)?

Continuous Integration is the practice of automatically integrating code changes into a software project. It relies on a code repository that supports automated building and testing. Often, CI also involves setting up a build system from scratch, including all dependencies.

Please also see our documentation on “Continuous Integration / Gitlab Cx“.

What is Continuous Testing (CT)?

It is the practice of executing automated tests as an integral part of the software development process. It tries to make sure that no functionality is lost and no errors are introduced during development.

Please also see our documentation on “Continuous Integration / Gitlab Cx“.

FileSystems/Data Storage

I have to analyze over 2 million files in my job. What can I do?

Please go through the presentation we provide on Using File Systems Properly; there is also a video recording available.

If supported by the application, use containerized formats (e.g. HDF5) or file-based databases. Otherwise, pack your files into an archive (e.g. tar + optional compression) and use node-local storage that is accessible via $TMPDIR.

Fritz Cluster

The software I need is not installed. What can I do now?

On all HPC systems, established tools for software development (compilers, editors, …), libraries, and selected applications are available. For many of these applications, it is necessary to set special environment variables, so that e.g. search paths are correct or license servers can be found.

To ease selection of and switching between different versions of software packages, all HPC systems use the modules system (cf. modules.sourceforge.net). It allows to conveniently load the necessary configurations for different programs or different versions of the same program an, if necessary, unload them again later.

For information on how to use the modules system, please have a look into the respective section in our documentation about “Software environment“.

Even more packages will become visible once one of the 000-all-spack-pkgs modules has been loaded. Most of the software is installed using “Spack“ as enhanced HPC package manager.

Feel free to compile software in the versions and with the options you need yourself. This is perfectly fine, yet support for self-installed software cannot be granted. We only can provide software centrally which is of importance for multiple groups. If you want to use Spack for compiling additional software, you can load our user-spack module to make use of the packages we already build with Spack if the concretization matches instead of starting from scratch. Once user-spack is loaded, the command spack will be available (as alias), you will inherit the pre-sets we defined for certain packages (e.g. Open MPI to work with Slurm), but you’ll install everything into your own directories ($WORK/USER-SPACK).

You can also bring your own environment in a container using Singularity. However, building Singularity containers on the HPC systems themselves is not supported (as that would require root access). The Infiniband drivers from the host are not mounted into your container. All filesystems will also be available by default in the container. In certain use cases it might be a good idea to avoid bind-mounting your normal $HOME directory with all its “dot directories” into the container by explicitly specifying a different directory, e.g. -H $HOME/my-container-home.

GPU usage

Why is my code not running on the GPU in TinyGPU?

Do not try to build a GPU-enabled Tensorflow, pytorch, … on the Woody login nodes. That will fail as the Woody login nodes do not have Nvidia software installed (moreover, most TinyGPU nodes run a newer OS version).

Please have look at our documentation about “Working with NVIDIA GPUs“.

Why is my code not using the GPU on Alex?

CUDA is not installed as part of the OS – you have to load a cuda module for your binaries to find libcublas, libcudnn, etc.

Please also have a look at our documentation “Working with NVIDIA GPUs“.

Software environment

The software I need is not installed. What can I do now?

On all HPC systems, established tools for software development (compilers, editors, …), libraries, and selected applications are available. For many of these applications, it is necessary to set special environment variables, so that e.g. search paths are correct or license servers can be found.

To ease selection of and switching between different versions of software packages, all HPC systems use the modules system (cf. modules.sourceforge.net). It allows to conveniently load the necessary configurations for different programs or different versions of the same program an, if necessary, unload them again later.

For information on how to use the modules system, please have a look into the respective section in our documentation about “Software environment“.

Even more packages will become visible once one of the 000-all-spack-pkgs modules has been loaded. Most of the software is installed using “Spack“ as enhanced HPC package manager.

Feel free to compile software in the versions and with the options you need yourself. This is perfectly fine, yet support for self-installed software cannot be granted. We only can provide software centrally which is of importance for multiple groups. If you want to use Spack for compiling additional software, you can load our user-spack module to make use of the packages we already build with Spack if the concretization matches instead of starting from scratch. Once user-spack is loaded, the command spack will be available (as alias), you will inherit the pre-sets we defined for certain packages (e.g. Open MPI to work with Slurm), but you’ll install everything into your own directories ($WORK/USER-SPACK).

You can also bring your own environment in a container using Singularity. However, building Singularity containers on the HPC systems themselves is not supported (as that would require root access). The Infiniband drivers from the host are not mounted into your container. All filesystems will also be available by default in the container. In certain use cases it might be a good idea to avoid bind-mounting your normal $HOME directory with all its “dot directories” into the container by explicitly specifying a different directory, e.g. -H $HOME/my-container-home.

Why is my code not running on the GPU in TinyGPU?

Do not try to build a GPU-enabled Tensorflow, pytorch, … on the Woody login nodes. That will fail as the Woody login nodes do not have Nvidia software installed (moreover, most TinyGPU nodes run a newer OS version).

Please have look at our documentation about “Working with NVIDIA GPUs“.

Why is my code not using the GPU on Alex?

CUDA is not installed as part of the OS – you have to load a cuda module for your binaries to find libcublas, libcudnn, etc.

Please also have a look at our documentation “Working with NVIDIA GPUs“.

TinyGPU Cluster

I have to analyze over 2 million files in my job. What can I do?

Please go through the presentation we provide on Using File Systems Properly; there is also a video recording available.

If supported by the application, use containerized formats (e.g. HDF5) or file-based databases. Otherwise, pack your files into an archive (e.g. tar + optional compression) and use node-local storage that is accessible via $TMPDIR.

The software I need is not installed. What can I do now?

On all HPC systems, established tools for software development (compilers, editors, …), libraries, and selected applications are available. For many of these applications, it is necessary to set special environment variables, so that e.g. search paths are correct or license servers can be found.

To ease selection of and switching between different versions of software packages, all HPC systems use the modules system (cf. modules.sourceforge.net). It allows to conveniently load the necessary configurations for different programs or different versions of the same program an, if necessary, unload them again later.

For information on how to use the modules system, please have a look into the respective section in our documentation about “Software environment“.

Even more packages will become visible once one of the 000-all-spack-pkgs modules has been loaded. Most of the software is installed using “Spack“ as enhanced HPC package manager.

Feel free to compile software in the versions and with the options you need yourself. This is perfectly fine, yet support for self-installed software cannot be granted. We only can provide software centrally which is of importance for multiple groups. If you want to use Spack for compiling additional software, you can load our user-spack module to make use of the packages we already build with Spack if the concretization matches instead of starting from scratch. Once user-spack is loaded, the command spack will be available (as alias), you will inherit the pre-sets we defined for certain packages (e.g. Open MPI to work with Slurm), but you’ll install everything into your own directories ($WORK/USER-SPACK).

You can also bring your own environment in a container using Singularity. However, building Singularity containers on the HPC systems themselves is not supported (as that would require root access). The Infiniband drivers from the host are not mounted into your container. All filesystems will also be available by default in the container. In certain use cases it might be a good idea to avoid bind-mounting your normal $HOME directory with all its “dot directories” into the container by explicitly specifying a different directory, e.g. -H $HOME/my-container-home.

Why is my code not running on the GPU in TinyGPU?

Do not try to build a GPU-enabled Tensorflow, pytorch, … on the Woody login nodes. That will fail as the Woody login nodes do not have Nvidia software installed (moreover, most TinyGPU nodes run a newer OS version).

Please have look at our documentation about “Working with NVIDIA GPUs“.

Woody Cluster

The software I need is not installed. What can I do now?

On all HPC systems, established tools for software development (compilers, editors, …), libraries, and selected applications are available. For many of these applications, it is necessary to set special environment variables, so that e.g. search paths are correct or license servers can be found.

To ease selection of and switching between different versions of software packages, all HPC systems use the modules system (cf. modules.sourceforge.net). It allows to conveniently load the necessary configurations for different programs or different versions of the same program an, if necessary, unload them again later.

For information on how to use the modules system, please have a look into the respective section in our documentation about “Software environment“.

Even more packages will become visible once one of the 000-all-spack-pkgs modules has been loaded. Most of the software is installed using “Spack“ as enhanced HPC package manager.

Feel free to compile software in the versions and with the options you need yourself. This is perfectly fine, yet support for self-installed software cannot be granted. We only can provide software centrally which is of importance for multiple groups. If you want to use Spack for compiling additional software, you can load our user-spack module to make use of the packages we already build with Spack if the concretization matches instead of starting from scratch. Once user-spack is loaded, the command spack will be available (as alias), you will inherit the pre-sets we defined for certain packages (e.g. Open MPI to work with Slurm), but you’ll install everything into your own directories ($WORK/USER-SPACK).

You can also bring your own environment in a container using Singularity. However, building Singularity containers on the HPC systems themselves is not supported (as that would require root access). The Infiniband drivers from the host are not mounted into your container. All filesystems will also be available by default in the container. In certain use cases it might be a good idea to avoid bind-mounting your normal $HOME directory with all its “dot directories” into the container by explicitly specifying a different directory, e.g. -H $HOME/my-container-home.

I have to analyze over 2 million files in my job. What can I do?
Please go through the presentation we provide on Using File Systems Properly; there is also a video recording available. If supported by the application, use containerized formats (e.g. HDF5) or file-based databases. Otherwise, pack your files into an archive (e.g. tar + optional compression) and use node-local storage that is accessible via $TMPDIR.
The software I need is not installed. What can I do now?
On all HPC systems, established tools for software development (compilers, editors, …), libraries, and selected applications are available. For many of these applications, it is necessary to set special environment variables, so that e.g. search paths are correct or license servers can be found. To ease selection of and switching between different versions of software packages, all HPC systems use the modules system (cf. modules.sourceforge.net). It allows to conveniently load the necessary configurations for different programs or different versions of the same program an, if necessary, unload them again later. For information on how to use the modules system, please have a look into the respective section in our documentation about “Software environment“. Even more packages will become visible once one of the 000-all-spack-pkgs modules has been loaded. Most of the software is installed using “Spack“ as enhanced HPC package manager. Feel free to compile software in the versions and with the options you need yourself. This is perfectly fine, yet support for self-installed software cannot be granted. We only can provide software centrally which is of importance for multiple groups. If you want to use Spack for compiling additional software, you can load our user-spack module to make use of the packages we already build with Spack if the concretization matches instead of starting from scratch. Once user-spack is loaded, the command spack will be available (as alias), you will inherit the pre-sets we defined for certain packages (e.g. Open MPI to work with Slurm), but you’ll install everything into your own directories ($WORK/USER-SPACK). You can also bring your own environment in a container using Singularity. However, building Singularity containers on the HPC systems themselves is not supported (as that would require root access). The Infiniband drivers from the host are not mounted into your container. All filesystems will also be available by default in the container. In certain use cases it might be a good idea to avoid bind-mounting your normal $HOME directory with all its “dot directories” into the container by explicitly specifying a different directory, e.g. -H $HOME/my-container-home.
Why is my code not using the GPU on Alex?
CUDA is not installed as part of the OS – you have to load a cuda module for your binaries to find libcublas, libcudnn, etc. Please also have a look at our documentation “Working with NVIDIA GPUs“.
I have to analyze over 2 million files in my job. What can I do?
Please go through the presentation we provide on Using File Systems Properly; there is also a video recording available. If supported by the application, use containerized formats (e.g. HDF5) or file-based databases. Otherwise, pack your files into an archive (e.g. tar + optional compression) and use node-local storage that is accessible via $TMPDIR.
I have to analyze over 2 million files in my job. What can I do?
Please go through the presentation we provide on Using File Systems Properly; there is also a video recording available. If supported by the application, use containerized formats (e.g. HDF5) or file-based databases. Otherwise, pack your files into an archive (e.g. tar + optional compression) and use node-local storage that is accessible via $TMPDIR.
Why is my code not running on the GPU in TinyGPU?
Do not try to build a GPU-enabled Tensorflow, pytorch, … on the Woody login nodes. That will fail as the Woody login nodes do not have Nvidia software installed (moreover, most TinyGPU nodes run a newer OS version). Please have look at our documentation about “Working with NVIDIA GPUs“.
Why is my code not using the GPU on Alex?
CUDA is not installed as part of the OS – you have to load a cuda module for your binaries to find libcublas, libcudnn, etc. Please also have a look at our documentation “Working with NVIDIA GPUs“.
The software I need is not installed. What can I do now?
On all HPC systems, established tools for software development (compilers, editors, …), libraries, and selected applications are available. For many of these applications, it is necessary to set special environment variables, so that e.g. search paths are correct or license servers can be found. To ease selection of and switching between different versions of software packages, all HPC systems use the modules system (cf. modules.sourceforge.net). It allows to conveniently load the necessary configurations for different programs or different versions of the same program an, if necessary, unload them again later. For information on how to use the modules system, please have a look into the respective section in our documentation about “Software environment“. Even more packages will become visible once one of the 000-all-spack-pkgs modules has been loaded. Most of the software is installed using “Spack“ as enhanced HPC package manager. Feel free to compile software in the versions and with the options you need yourself. This is perfectly fine, yet support for self-installed software cannot be granted. We only can provide software centrally which is of importance for multiple groups. If you want to use Spack for compiling additional software, you can load our user-spack module to make use of the packages we already build with Spack if the concretization matches instead of starting from scratch. Once user-spack is loaded, the command spack will be available (as alias), you will inherit the pre-sets we defined for certain packages (e.g. Open MPI to work with Slurm), but you’ll install everything into your own directories ($WORK/USER-SPACK). You can also bring your own environment in a container using Singularity. However, building Singularity containers on the HPC systems themselves is not supported (as that would require root access). The Infiniband drivers from the host are not mounted into your container. All filesystems will also be available by default in the container. In certain use cases it might be a good idea to avoid bind-mounting your normal $HOME directory with all its “dot directories” into the container by explicitly specifying a different directory, e.g. -H $HOME/my-container-home.
What is Continuous Benchmarking (CB)?
CB can be seen as a variant of CT, where not only functionality but also performance is tested in order to avoid regressions, i.e., unwanted performance degradation due to code changes. Please also see our documentation on “Continuous Integration / Gitlab Cx“.
What is Continuous Deployment (CD)?
CD is the automatic deployment of the software coming out of the other Cx processes. This can be the installation on a particular system, rolling out a revision within a whole organization, pushing installation packages to public repositories, etc. Please also see our documentation on “Continuous Integration / Gitlab Cx“.
What is Continuous Integration (CI)?
Continuous Integration is the practice of automatically integrating code changes into a software project. It relies on a code repository that supports automated building and testing. Often, CI also involves setting up a build system from scratch, including all dependencies. Please also see our documentation on “Continuous Integration / Gitlab Cx“.
What is Continuous Testing (CT)?
It is the practice of executing automated tests as an integral part of the software development process. It tries to make sure that no functionality is lost and no errors are introduced during development. Please also see our documentation on “Continuous Integration / Gitlab Cx“.
I have to analyze over 2 million files in my job. What can I do?
Please go through the presentation we provide on Using File Systems Properly; there is also a video recording available. If supported by the application, use containerized formats (e.g. HDF5) or file-based databases. Otherwise, pack your files into an archive (e.g. tar + optional compression) and use node-local storage that is accessible via $TMPDIR.
The software I need is not installed. What can I do now?
On all HPC systems, established tools for software development (compilers, editors, …), libraries, and selected applications are available. For many of these applications, it is necessary to set special environment variables, so that e.g. search paths are correct or license servers can be found. To ease selection of and switching between different versions of software packages, all HPC systems use the modules system (cf. modules.sourceforge.net). It allows to conveniently load the necessary configurations for different programs or different versions of the same program an, if necessary, unload them again later. For information on how to use the modules system, please have a look into the respective section in our documentation about “Software environment“. Even more packages will become visible once one of the 000-all-spack-pkgs modules has been loaded. Most of the software is installed using “Spack“ as enhanced HPC package manager. Feel free to compile software in the versions and with the options you need yourself. This is perfectly fine, yet support for self-installed software cannot be granted. We only can provide software centrally which is of importance for multiple groups. If you want to use Spack for compiling additional software, you can load our user-spack module to make use of the packages we already build with Spack if the concretization matches instead of starting from scratch. Once user-spack is loaded, the command spack will be available (as alias), you will inherit the pre-sets we defined for certain packages (e.g. Open MPI to work with Slurm), but you’ll install everything into your own directories ($WORK/USER-SPACK). You can also bring your own environment in a container using Singularity. However, building Singularity containers on the HPC systems themselves is not supported (as that would require root access). The Infiniband drivers from the host are not mounted into your container. All filesystems will also be available by default in the container. In certain use cases it might be a good idea to avoid bind-mounting your normal $HOME directory with all its “dot directories” into the container by explicitly specifying a different directory, e.g. -H $HOME/my-container-home.
Why is my code not running on the GPU in TinyGPU?
Do not try to build a GPU-enabled Tensorflow, pytorch, … on the Woody login nodes. That will fail as the Woody login nodes do not have Nvidia software installed (moreover, most TinyGPU nodes run a newer OS version). Please have look at our documentation about “Working with NVIDIA GPUs“.
Why is my code not using the GPU on Alex?
CUDA is not installed as part of the OS – you have to load a cuda module for your binaries to find libcublas, libcudnn, etc. Please also have a look at our documentation “Working with NVIDIA GPUs“.
The software I need is not installed. What can I do now?
On all HPC systems, established tools for software development (compilers, editors, …), libraries, and selected applications are available. For many of these applications, it is necessary to set special environment variables, so that e.g. search paths are correct or license servers can be found. To ease selection of and switching between different versions of software packages, all HPC systems use the modules system (cf. modules.sourceforge.net). It allows to conveniently load the necessary configurations for different programs or different versions of the same program an, if necessary, unload them again later. For information on how to use the modules system, please have a look into the respective section in our documentation about “Software environment“. Even more packages will become visible once one of the 000-all-spack-pkgs modules has been loaded. Most of the software is installed using “Spack“ as enhanced HPC package manager. Feel free to compile software in the versions and with the options you need yourself. This is perfectly fine, yet support for self-installed software cannot be granted. We only can provide software centrally which is of importance for multiple groups. If you want to use Spack for compiling additional software, you can load our user-spack module to make use of the packages we already build with Spack if the concretization matches instead of starting from scratch. Once user-spack is loaded, the command spack will be available (as alias), you will inherit the pre-sets we defined for certain packages (e.g. Open MPI to work with Slurm), but you’ll install everything into your own directories ($WORK/USER-SPACK). You can also bring your own environment in a container using Singularity. However, building Singularity containers on the HPC systems themselves is not supported (as that would require root access). The Infiniband drivers from the host are not mounted into your container. All filesystems will also be available by default in the container. In certain use cases it might be a good idea to avoid bind-mounting your normal $HOME directory with all its “dot directories” into the container by explicitly specifying a different directory, e.g. -H $HOME/my-container-home.
Why is my code not running on the GPU in TinyGPU?
Do not try to build a GPU-enabled Tensorflow, pytorch, … on the Woody login nodes. That will fail as the Woody login nodes do not have Nvidia software installed (moreover, most TinyGPU nodes run a newer OS version). Please have look at our documentation about “Working with NVIDIA GPUs“.
Why is my code not using the GPU on Alex?
CUDA is not installed as part of the OS – you have to load a cuda module for your binaries to find libcublas, libcudnn, etc. Please also have a look at our documentation “Working with NVIDIA GPUs“.
I have to analyze over 2 million files in my job. What can I do?
Please go through the presentation we provide on Using File Systems Properly; there is also a video recording available. If supported by the application, use containerized formats (e.g. HDF5) or file-based databases. Otherwise, pack your files into an archive (e.g. tar + optional compression) and use node-local storage that is accessible via $TMPDIR.
The software I need is not installed. What can I do now?
On all HPC systems, established tools for software development (compilers, editors, …), libraries, and selected applications are available. For many of these applications, it is necessary to set special environment variables, so that e.g. search paths are correct or license servers can be found. To ease selection of and switching between different versions of software packages, all HPC systems use the modules system (cf. modules.sourceforge.net). It allows to conveniently load the necessary configurations for different programs or different versions of the same program an, if necessary, unload them again later. For information on how to use the modules system, please have a look into the respective section in our documentation about “Software environment“. Even more packages will become visible once one of the 000-all-spack-pkgs modules has been loaded. Most of the software is installed using “Spack“ as enhanced HPC package manager. Feel free to compile software in the versions and with the options you need yourself. This is perfectly fine, yet support for self-installed software cannot be granted. We only can provide software centrally which is of importance for multiple groups. If you want to use Spack for compiling additional software, you can load our user-spack module to make use of the packages we already build with Spack if the concretization matches instead of starting from scratch. Once user-spack is loaded, the command spack will be available (as alias), you will inherit the pre-sets we defined for certain packages (e.g. Open MPI to work with Slurm), but you’ll install everything into your own directories ($WORK/USER-SPACK). You can also bring your own environment in a container using Singularity. However, building Singularity containers on the HPC systems themselves is not supported (as that would require root access). The Infiniband drivers from the host are not mounted into your container. All filesystems will also be available by default in the container. In certain use cases it might be a good idea to avoid bind-mounting your normal $HOME directory with all its “dot directories” into the container by explicitly specifying a different directory, e.g. -H $HOME/my-container-home.
Why is my code not running on the GPU in TinyGPU?
Do not try to build a GPU-enabled Tensorflow, pytorch, … on the Woody login nodes. That will fail as the Woody login nodes do not have Nvidia software installed (moreover, most TinyGPU nodes run a newer OS version). Please have look at our documentation about “Working with NVIDIA GPUs“.
The software I need is not installed. What can I do now?
On all HPC systems, established tools for software development (compilers, editors, …), libraries, and selected applications are available. For many of these applications, it is necessary to set special environment variables, so that e.g. search paths are correct or license servers can be found. To ease selection of and switching between different versions of software packages, all HPC systems use the modules system (cf. modules.sourceforge.net). It allows to conveniently load the necessary configurations for different programs or different versions of the same program an, if necessary, unload them again later. For information on how to use the modules system, please have a look into the respective section in our documentation about “Software environment“. Even more packages will become visible once one of the 000-all-spack-pkgs modules has been loaded. Most of the software is installed using “Spack“ as enhanced HPC package manager. Feel free to compile software in the versions and with the options you need yourself. This is perfectly fine, yet support for self-installed software cannot be granted. We only can provide software centrally which is of importance for multiple groups. If you want to use Spack for compiling additional software, you can load our user-spack module to make use of the packages we already build with Spack if the concretization matches instead of starting from scratch. Once user-spack is loaded, the command spack will be available (as alias), you will inherit the pre-sets we defined for certain packages (e.g. Open MPI to work with Slurm), but you’ll install everything into your own directories ($WORK/USER-SPACK). You can also bring your own environment in a container using Singularity. However, building Singularity containers on the HPC systems themselves is not supported (as that would require root access). The Infiniband drivers from the host are not mounted into your container. All filesystems will also be available by default in the container. In certain use cases it might be a good idea to avoid bind-mounting your normal $HOME directory with all its “dot directories” into the container by explicitly specifying a different directory, e.g. -H $HOME/my-container-home.

Hardware

Alex Cluster

I have to analyze over 2 million files in my job. What can I do?

Please go through the presentation we provide on Using File Systems Properly; there is also a video recording available.

If supported by the application, use containerized formats (e.g. HDF5) or file-based databases. Otherwise, pack your files into an archive (e.g. tar + optional compression) and use node-local storage that is accessible via $TMPDIR.

Basic HPC Knowledge

I have to analyze over 2 million files in my job. What can I do?

Please go through the presentation we provide on Using File Systems Properly; there is also a video recording available.

If supported by the application, use containerized formats (e.g. HDF5) or file-based databases. Otherwise, pack your files into an archive (e.g. tar + optional compression) and use node-local storage that is accessible via $TMPDIR.

Is it true that Arm processors are now competitive in HPC?

“Arm CPU” just means that it uses an instruction set architecture (ISA) licensed from Arm, but the hardware implementation can vary a lot. There is a plethora of Arm-based designs, some from Arm and many from other vendors. Many target the low-power/embedded market, but there are some which have entered the HPC area. Prominent examples are the Fujitsu A64FX and the Marvell ThunderX2. A TX2 system is available in the NHR test and benchmark cluster.

What is a parallel file system?

In a parallel file system (PFS), data is distributed not only across several disks but also multiple servers in order to increase the data access bandwidth. Most PFS’s are connected to the high-speed network of a cluster, and aggregated bandwidths in the TByte/s range are not uncommon. High bandwidth can, however, only be obtained with large files and streaming access.

For information on how to use a parallel file system on our clusters, please read our documentation on “Parallel file systems $FASTTMP“.

What is a vector computer?

A vector computer has an ISA and CPU architecture that enable efficient operations on array data. This goes under the name of Single Instruction Multiple Data (SIMD). SIMD features have proliferated in commodity CPUs as well, but a true vector CPU has features that make it more efficient, such as large vector lengths (e.g., 256 elements) and a high memory bandwidth (e.g., 1.5 Tbyte/s). Currently, only NEC offers a true vector processor, the SX-Aurora Tsubasa. A node with two Tsubasa cards is available in the NHR test and benchmark cluster.

Why should I care about file systems?

Not only may efficient file operations speed up your own code (if file I/O is what you must do); they will also reduce the burden on shared file servers and thus leave more performance headroom for other users of the resource. Hence, it is a matter of thoughtfulness to optimize file accesses even if your performance gain is marginal.

Batch System

I have to analyze over 2 million files in my job. What can I do?

Please go through the presentation we provide on Using File Systems Properly; there is also a video recording available.

If supported by the application, use containerized formats (e.g. HDF5) or file-based databases. Otherwise, pack your files into an archive (e.g. tar + optional compression) and use node-local storage that is accessible via $TMPDIR.

FileSystems/Data Storage

How can I leverage node-local storage on TinyGPU to increase job performance?

Each node has at least 880 GB of local SSD capacity for temporary files under $TMPDIR.

The directory $TMPDIR will be deleted automatically as soon as the user has no jobs running on the node any more.

Data to be kept can be copied to a cluster-wide volume at the end of the job.

Please also read our documentation on “File Systems“.

I have to analyze over 2 million files in my job. What can I do?

Please go through the presentation we provide on Using File Systems Properly; there is also a video recording available.

If supported by the application, use containerized formats (e.g. HDF5) or file-based databases. Otherwise, pack your files into an archive (e.g. tar + optional compression) and use node-local storage that is accessible via $TMPDIR.

What is a parallel file system?

In a parallel file system (PFS), data is distributed not only across several disks but also multiple servers in order to increase the data access bandwidth. Most PFS’s are connected to the high-speed network of a cluster, and aggregated bandwidths in the TByte/s range are not uncommon. High bandwidth can, however, only be obtained with large files and streaming access.

For information on how to use a parallel file system on our clusters, please read our documentation on “Parallel file systems $FASTTMP“.

What is file system metadata?

Metadata comprises all the bookkeeping information in a file system: file sizes, permissions, modification and access times, etc. A workload that, e.g., opens and closes files in rapid succession leads to frequent metadata accesses, putting a lot of strain on any file server infrastructure. This is why a small number of users with “toxic” workload can slow down file operations to a crawl for everyone. Note also that especially parallel file systems are ill-suited for metadata-heavy operations.

Where can I store my data?

Your home directory is accessible via $HOME. Each user gets a standard quota of 50 Gigabytes and quota extensions are not possible.

Additional storage is accessible via $HPCVAULT. Here, the default quota for each user is 500 Gigabytes.

The recommended work directory is accessible via $WORK. The standard quota for each user is 500 Gigabytes.

All three directories ($HOME, $HPCVAULT and $WORK) are available throughout our HPC systems.

We recommend you use the aforementioned variables in your jobscripts and not rely on the specific paths as this may change over time, i.e. when directories are relocated to a different NFS server.

Job-specific storage (either located in main memory [RAM disk] or, if available, local HDD / SDD) is accessible via $TMPDIR and always node-local. Size differs between clusters and is only available during job lifetime. Data is flushed after the job finishes!

Some of our clusters have a local parallel filesystem for high performance short-term storage that is accessible via $FASTTMP. These filesystems are specific to the clusters and not available on other clusters. This type of storage is not suitable for programs such as MD simulations that have quite high output rates!

Please also have a look into our documentation on “File Systems“.

Why should I care about file systems?

Not only may efficient file operations speed up your own code (if file I/O is what you must do); they will also reduce the burden on shared file servers and thus leave more performance headroom for other users of the resource. Hence, it is a matter of thoughtfulness to optimize file accesses even if your performance gain is marginal.

Why the need for several file systems?

Different file systems have different features; for example, a central NFS server has massive bytes for the buck but limited data bandwidth, while a parallel file system is much faster but smaller and usually available to one cluster only. A node-local SSD, one the other hand, has the advantage of very low latency but it cannot be accessed from outside a compute node.

For further information, please consult our documentation on “File systems“.

Miscellaneous

Is it true that Arm processors are now competitive in HPC?

“Arm CPU” just means that it uses an instruction set architecture (ISA) licensed from Arm, but the hardware implementation can vary a lot. There is a plethora of Arm-based designs, some from Arm and many from other vendors. Many target the low-power/embedded market, but there are some which have entered the HPC area. Prominent examples are the Fujitsu A64FX and the Marvell ThunderX2. A TX2 system is available in the NHR test and benchmark cluster.

What is a vector computer?

A vector computer has an ISA and CPU architecture that enable efficient operations on array data. This goes under the name of Single Instruction Multiple Data (SIMD). SIMD features have proliferated in commodity CPUs as well, but a true vector CPU has features that make it more efficient, such as large vector lengths (e.g., 256 elements) and a high memory bandwidth (e.g., 1.5 Tbyte/s). Currently, only NEC offers a true vector processor, the SX-Aurora Tsubasa. A node with two Tsubasa cards is available in the NHR test and benchmark cluster.

Test Cluster

Is it true that Arm processors are now competitive in HPC?

“Arm CPU” just means that it uses an instruction set architecture (ISA) licensed from Arm, but the hardware implementation can vary a lot. There is a plethora of Arm-based designs, some from Arm and many from other vendors. Many target the low-power/embedded market, but there are some which have entered the HPC area. Prominent examples are the Fujitsu A64FX and the Marvell ThunderX2. A TX2 system is available in the NHR test and benchmark cluster.

What is a vector computer?

A vector computer has an ISA and CPU architecture that enable efficient operations on array data. This goes under the name of Single Instruction Multiple Data (SIMD). SIMD features have proliferated in commodity CPUs as well, but a true vector CPU has features that make it more efficient, such as large vector lengths (e.g., 256 elements) and a high memory bandwidth (e.g., 1.5 Tbyte/s). Currently, only NEC offers a true vector processor, the SX-Aurora Tsubasa. A node with two Tsubasa cards is available in the NHR test and benchmark cluster.

TinyGPU Cluster

How can I leverage node-local storage on TinyGPU to increase job performance?

Each node has at least 880 GB of local SSD capacity for temporary files under $TMPDIR.

The directory $TMPDIR will be deleted automatically as soon as the user has no jobs running on the node any more.

Data to be kept can be copied to a cluster-wide volume at the end of the job.

Please also read our documentation on “File Systems“.

I have to analyze over 2 million files in my job. What can I do?

Please go through the presentation we provide on Using File Systems Properly; there is also a video recording available.

If supported by the application, use containerized formats (e.g. HDF5) or file-based databases. Otherwise, pack your files into an archive (e.g. tar + optional compression) and use node-local storage that is accessible via $TMPDIR.

I have to analyze over 2 million files in my job. What can I do?
Please go through the presentation we provide on Using File Systems Properly; there is also a video recording available. If supported by the application, use containerized formats (e.g. HDF5) or file-based databases. Otherwise, pack your files into an archive (e.g. tar + optional compression) and use node-local storage that is accessible via $TMPDIR.
I have to analyze over 2 million files in my job. What can I do?
Please go through the presentation we provide on Using File Systems Properly; there is also a video recording available. If supported by the application, use containerized formats (e.g. HDF5) or file-based databases. Otherwise, pack your files into an archive (e.g. tar + optional compression) and use node-local storage that is accessible via $TMPDIR.
Is it true that Arm processors are now competitive in HPC?
“Arm CPU” just means that it uses an instruction set architecture (ISA) licensed from Arm, but the hardware implementation can vary a lot. There is a plethora of Arm-based designs, some from Arm and many from other vendors. Many target the low-power/embedded market, but there are some which have entered the HPC area. Prominent examples are the Fujitsu A64FX and the Marvell ThunderX2. A TX2 system is available in the NHR test and benchmark cluster.
What is a parallel file system?
In a parallel file system (PFS), data is distributed not only across several disks but also multiple servers in order to increase the data access bandwidth. Most PFS’s are connected to the high-speed network of a cluster, and aggregated bandwidths in the TByte/s range are not uncommon. High bandwidth can, however, only be obtained with large files and streaming access. For information on how to use a parallel file system on our clusters, please read our documentation on “Parallel file systems $FASTTMP“.
What is a vector computer?
A vector computer has an ISA and CPU architecture that enable efficient operations on array data. This goes under the name of Single Instruction Multiple Data (SIMD). SIMD features have proliferated in commodity CPUs as well, but a true vector CPU has features that make it more efficient, such as large vector lengths (e.g., 256 elements) and a high memory bandwidth (e.g., 1.5 Tbyte/s). Currently, only NEC offers a true vector processor, the SX-Aurora Tsubasa. A node with two Tsubasa cards is available in the NHR test and benchmark cluster.
Why should I care about file systems?
Not only may efficient file operations speed up your own code (if file I/O is what you must do); they will also reduce the burden on shared file servers and thus leave more performance headroom for other users of the resource. Hence, it is a matter of thoughtfulness to optimize file accesses even if your performance gain is marginal.
I have to analyze over 2 million files in my job. What can I do?
Please go through the presentation we provide on Using File Systems Properly; there is also a video recording available. If supported by the application, use containerized formats (e.g. HDF5) or file-based databases. Otherwise, pack your files into an archive (e.g. tar + optional compression) and use node-local storage that is accessible via $TMPDIR.
How can I leverage node-local storage on TinyGPU to increase job performance?
Each node has at least 880 GB of local SSD capacity for temporary files under $TMPDIR. The directory $TMPDIR will be deleted automatically as soon as the user has no jobs running on the node any more. Data to be kept can be copied to a cluster-wide volume at the end of the job. Please also read our documentation on “File Systems“.
I have to analyze over 2 million files in my job. What can I do?
Please go through the presentation we provide on Using File Systems Properly; there is also a video recording available. If supported by the application, use containerized formats (e.g. HDF5) or file-based databases. Otherwise, pack your files into an archive (e.g. tar + optional compression) and use node-local storage that is accessible via $TMPDIR.
What is a parallel file system?
In a parallel file system (PFS), data is distributed not only across several disks but also multiple servers in order to increase the data access bandwidth. Most PFS’s are connected to the high-speed network of a cluster, and aggregated bandwidths in the TByte/s range are not uncommon. High bandwidth can, however, only be obtained with large files and streaming access. For information on how to use a parallel file system on our clusters, please read our documentation on “Parallel file systems $FASTTMP“.
What is file system metadata?
Metadata comprises all the bookkeeping information in a file system: file sizes, permissions, modification and access times, etc. A workload that, e.g., opens and closes files in rapid succession leads to frequent metadata accesses, putting a lot of strain on any file server infrastructure. This is why a small number of users with “toxic” workload can slow down file operations to a crawl for everyone. Note also that especially parallel file systems are ill-suited for metadata-heavy operations.
Where can I store my data?
Your home directory is accessible via $HOME. Each user gets a standard quota of 50 Gigabytes and quota extensions are not possible. Additional storage is accessible via $HPCVAULT. Here, the default quota for each user is 500 Gigabytes. The recommended work directory is accessible via $WORK. The standard quota for each user is 500 Gigabytes. All three directories ($HOME, $HPCVAULT and $WORK) are available throughout our HPC systems. We recommend you use the aforementioned variables in your jobscripts and not rely on the specific paths as this may change over time, i.e. when directories are relocated to a different NFS server. Job-specific storage (either located in main memory [RAM disk] or, if available, local HDD / SDD) is accessible via $TMPDIR and always node-local. Size differs between clusters and is only available during job lifetime. Data is flushed after the job finishes! Some of our clusters have a local parallel filesystem for high performance short-term storage that is accessible via $FASTTMP. These filesystems are specific to the clusters and not available on other clusters. This type of storage is not suitable for programs such as MD simulations that have quite high output rates! Please also have a look into our documentation on “File Systems“.
Why should I care about file systems?
Not only may efficient file operations speed up your own code (if file I/O is what you must do); they will also reduce the burden on shared file servers and thus leave more performance headroom for other users of the resource. Hence, it is a matter of thoughtfulness to optimize file accesses even if your performance gain is marginal.
Why the need for several file systems?
Different file systems have different features; for example, a central NFS server has massive bytes for the buck but limited data bandwidth, while a parallel file system is much faster but smaller and usually available to one cluster only. A node-local SSD, one the other hand, has the advantage of very low latency but it cannot be accessed from outside a compute node. For further information, please consult our documentation on “File systems“.
Is it true that Arm processors are now competitive in HPC?
“Arm CPU” just means that it uses an instruction set architecture (ISA) licensed from Arm, but the hardware implementation can vary a lot. There is a plethora of Arm-based designs, some from Arm and many from other vendors. Many target the low-power/embedded market, but there are some which have entered the HPC area. Prominent examples are the Fujitsu A64FX and the Marvell ThunderX2. A TX2 system is available in the NHR test and benchmark cluster.
What is a vector computer?
A vector computer has an ISA and CPU architecture that enable efficient operations on array data. This goes under the name of Single Instruction Multiple Data (SIMD). SIMD features have proliferated in commodity CPUs as well, but a true vector CPU has features that make it more efficient, such as large vector lengths (e.g., 256 elements) and a high memory bandwidth (e.g., 1.5 Tbyte/s). Currently, only NEC offers a true vector processor, the SX-Aurora Tsubasa. A node with two Tsubasa cards is available in the NHR test and benchmark cluster.
Is it true that Arm processors are now competitive in HPC?
“Arm CPU” just means that it uses an instruction set architecture (ISA) licensed from Arm, but the hardware implementation can vary a lot. There is a plethora of Arm-based designs, some from Arm and many from other vendors. Many target the low-power/embedded market, but there are some which have entered the HPC area. Prominent examples are the Fujitsu A64FX and the Marvell ThunderX2. A TX2 system is available in the NHR test and benchmark cluster.
What is a vector computer?
A vector computer has an ISA and CPU architecture that enable efficient operations on array data. This goes under the name of Single Instruction Multiple Data (SIMD). SIMD features have proliferated in commodity CPUs as well, but a true vector CPU has features that make it more efficient, such as large vector lengths (e.g., 256 elements) and a high memory bandwidth (e.g., 1.5 Tbyte/s). Currently, only NEC offers a true vector processor, the SX-Aurora Tsubasa. A node with two Tsubasa cards is available in the NHR test and benchmark cluster.
How can I leverage node-local storage on TinyGPU to increase job performance?
Each node has at least 880 GB of local SSD capacity for temporary files under $TMPDIR. The directory $TMPDIR will be deleted automatically as soon as the user has no jobs running on the node any more. Data to be kept can be copied to a cluster-wide volume at the end of the job. Please also read our documentation on “File Systems“.
I have to analyze over 2 million files in my job. What can I do?
Please go through the presentation we provide on Using File Systems Properly; there is also a video recording available. If supported by the application, use containerized formats (e.g. HDF5) or file-based databases. Otherwise, pack your files into an archive (e.g. tar + optional compression) and use node-local storage that is accessible via $TMPDIR.

General information

Acknowledgement

How to acknowledge resource usage

In general, use the following formulation in publications for acknowledging the resources and the support of NHR@FAU:

  • for the FAU Tier3 resources: “The authors gratefully acknowledge the scientific support and HPC resources provided by the Erlangen National High Performance Computing Center (NHR@FAU) of the Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU). The hardware is funded by the German Research Foundation (DFG).”
  • for the NHR@FAU resources/projects: “The authors gratefully acknowledge the scientific support and HPC resources provided by the Erlangen National High Performance Computing Center (NHR@FAU) of the Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU) under the NHR project <ID of your NHR@FAU project>. NHR funding is provided by federal and Bavarian state authorities. NHR@FAU hardware is partially funded by the German Research Foundation (DFG) – 440719683.”

(Also do not forget to send a copy of your papers to nhr-redaktion@lists.fau.de!)

Alex Cluster

How can I access the new clusters Alex and Fritz?

FAU staff and students who already have an HPC account can request access to Alex here: https://hpc.fau.de/tier3-access-to-alex/ and access to Fritz here: https://hpc.fau.de/tier3-access-to-fritz/. Access is restricted to  projects with extended demands, thus, not feasible on TinyGPU or Woody/Meggie, but still below the NHR thresholds. You have to prove that and provide a short description of what you want to do there.

If you do not have an HPC account, please follow our instructions on “Getting started with HPC“.

External scientists have to submit a NHR proposal to get access.

Basic HPC Knowledge

Is it true that Arm processors are now competitive in HPC?

“Arm CPU” just means that it uses an instruction set architecture (ISA) licensed from Arm, but the hardware implementation can vary a lot. There is a plethora of Arm-based designs, some from Arm and many from other vendors. Many target the low-power/embedded market, but there are some which have entered the HPC area. Prominent examples are the Fujitsu A64FX and the Marvell ThunderX2. A TX2 system is available in the NHR test and benchmark cluster.

What is a vector computer?

A vector computer has an ISA and CPU architecture that enable efficient operations on array data. This goes under the name of Single Instruction Multiple Data (SIMD). SIMD features have proliferated in commodity CPUs as well, but a true vector CPU has features that make it more efficient, such as large vector lengths (e.g., 256 elements) and a high memory bandwidth (e.g., 1.5 Tbyte/s). Currently, only NEC offers a true vector processor, the SX-Aurora Tsubasa. A node with two Tsubasa cards is available in the NHR test and benchmark cluster.

What is SMT (also known as hyperthreading)?

Simultaneous multi-threading (SMT) allows a CPU core to run more than one software thread at the same time. These “hardware threads” a.k.a. “virtual cores” share almost all resources. The purpose of this feature is to make better use of the execution units within the core. It is rather hard to predict the benefit of SMT for real applications, so the best strategy is to try it using a well-designed, realistic benchmark case.

What is thread or process affinity?

Modern multicore systems have a strong topology, i.e., groups of hardware threads share different resources such as cores, caches, and memory interfaces. Many performance features of parallel programs depend on where their threads and processes are running in the machine. This makes it vital to bind these threads and processes to hardware threads so that variability is reduced and resources are balanced.

Why does my program give a http/https timeout?

When running software, which tries to connect to the internet, on one of the cluster nodes you might encounter time-out errors.
By default we do not allow cluster nodes to access the internet.
However, you can circumvent this by setting a proxy:
export http_proxy=http://proxy:80
export https_proxy=http://proxy:80

Cluster Access

How can I access the cluster frontends?

Almost all HPC systems at NHR@FAU use private IP addresses that can only be accessed directly from within the FAU.There are three options for logging in to the clusters from outside the university:

  1. Use a VPN (Virtual Private Network) connection.
  2. Use IPv6. The cluster frontends have world-visible IPv6 addresses.
  3. Use our “dialog server” cshpc.rrze.fau.de. The dialog server is the only HPC machine with a public IPv4 address. cshpc is a Linux system that is open to all HPC accounts. From this machine, you can log into any NHR@FAU system. A more complete description can be found on our documentation pages.

Whichever option you choose, you need to use SSH. Please consult our extensive SSH documentation pages for details.

 

How can I access the new clusters Alex and Fritz?

FAU staff and students who already have an HPC account can request access to Alex here: https://hpc.fau.de/tier3-access-to-alex/ and access to Fritz here: https://hpc.fau.de/tier3-access-to-fritz/. Access is restricted to  projects with extended demands, thus, not feasible on TinyGPU or Woody/Meggie, but still below the NHR thresholds. You have to prove that and provide a short description of what you want to do there.

If you do not have an HPC account, please follow our instructions on “Getting started with HPC“.

External scientists have to submit a NHR proposal to get access.

How to acknowledge resource usage

In general, use the following formulation in publications for acknowledging the resources and the support of NHR@FAU:

  • for the FAU Tier3 resources: “The authors gratefully acknowledge the scientific support and HPC resources provided by the Erlangen National High Performance Computing Center (NHR@FAU) of the Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU). The hardware is funded by the German Research Foundation (DFG).”
  • for the NHR@FAU resources/projects: “The authors gratefully acknowledge the scientific support and HPC resources provided by the Erlangen National High Performance Computing Center (NHR@FAU) of the Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU) under the NHR project <ID of your NHR@FAU project>. NHR funding is provided by federal and Bavarian state authorities. NHR@FAU hardware is partially funded by the German Research Foundation (DFG) – 440719683.”

(Also do not forget to send a copy of your papers to nhr-redaktion@lists.fau.de!)

Contact

How can I contact the HPC team?

An informal and low-threshold way to talk to members of the HPC team is our regular HPC Café. The HPC Café takes place every second Tuesday of the month at 4:00 p.m. in seminar room 2.049 at RRZE, Martensstr.1, 91058 Erlangen. Due to the Covid-19 pandemic, the HPC Café was replaced by an online consultation hour since early 2020. Details are published on the HPC Café website.

Note: Currently we mostly work from home and thus cannot be reached via our office phone numbers. We can arrange virtual appointments by Zoom, MS Teams, or BigBlueButton. You may also call RRZE’s HelpDesk (+49-9131-85-29955) and leave a message for us—but you probably get a faster response by sending us an e-mail (hpc-support@fau.de).

FileSystems/Data Storage

Where can I store my data?

Your home directory is accessible via $HOME. Each user gets a standard quota of 50 Gigabytes and quota extensions are not possible.

Additional storage is accessible via $HPCVAULT. Here, the default quota for each user is 500 Gigabytes.

The recommended work directory is accessible via $WORK. The standard quota for each user is 500 Gigabytes.

All three directories ($HOME, $HPCVAULT and $WORK) are available throughout our HPC systems.

We recommend you use the aforementioned variables in your jobscripts and not rely on the specific paths as this may change over time, i.e. when directories are relocated to a different NFS server.

Job-specific storage (either located in main memory [RAM disk] or, if available, local HDD / SDD) is accessible via $TMPDIR and always node-local. Size differs between clusters and is only available during job lifetime. Data is flushed after the job finishes!

Some of our clusters have a local parallel filesystem for high performance short-term storage that is accessible via $FASTTMP. These filesystems are specific to the clusters and not available on other clusters. This type of storage is not suitable for programs such as MD simulations that have quite high output rates!

Please also have a look into our documentation on “File Systems“.

Fritz Cluster

How can I access the new clusters Alex and Fritz?

FAU staff and students who already have an HPC account can request access to Alex here: https://hpc.fau.de/tier3-access-to-alex/ and access to Fritz here: https://hpc.fau.de/tier3-access-to-fritz/. Access is restricted to  projects with extended demands, thus, not feasible on TinyGPU or Woody/Meggie, but still below the NHR thresholds. You have to prove that and provide a short description of what you want to do there.

If you do not have an HPC account, please follow our instructions on “Getting started with HPC“.

External scientists have to submit a NHR proposal to get access.

Login

How can I access the cluster frontends?

Almost all HPC systems at NHR@FAU use private IP addresses that can only be accessed directly from within the FAU.There are three options for logging in to the clusters from outside the university:

  1. Use a VPN (Virtual Private Network) connection.
  2. Use IPv6. The cluster frontends have world-visible IPv6 addresses.
  3. Use our “dialog server” cshpc.rrze.fau.de. The dialog server is the only HPC machine with a public IPv4 address. cshpc is a Linux system that is open to all HPC accounts. From this machine, you can log into any NHR@FAU system. A more complete description can be found on our documentation pages.

Whichever option you choose, you need to use SSH. Please consult our extensive SSH documentation pages for details.

 

How can I change my HPC password?

If you applied for the HPC account using the paper forms: Please log in to www.idm.fau.de with your IdM account and change the HPC password under Services.

Please note that it generally takes a few hours until password changes are known on the HPC systems, the change will not happen at the same time on all clusters.

If you got your HPC account through the new HPC portal, there is no password for the HPC account at all. Authorization at the HPC portal is done using SSO (provided by your home institute) and login to the HPC systems is by SSH keys (which have to be uploaded through the HPC portal). Information on how to upload SSH keys to the HPC portal can be found in our documentation on the HPC portal.

Miscellaneous

How to acknowledge resource usage

In general, use the following formulation in publications for acknowledging the resources and the support of NHR@FAU:

  • for the FAU Tier3 resources: “The authors gratefully acknowledge the scientific support and HPC resources provided by the Erlangen National High Performance Computing Center (NHR@FAU) of the Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU). The hardware is funded by the German Research Foundation (DFG).”
  • for the NHR@FAU resources/projects: “The authors gratefully acknowledge the scientific support and HPC resources provided by the Erlangen National High Performance Computing Center (NHR@FAU) of the Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU) under the NHR project <ID of your NHR@FAU project>. NHR funding is provided by federal and Bavarian state authorities. NHR@FAU hardware is partially funded by the German Research Foundation (DFG) – 440719683.”

(Also do not forget to send a copy of your papers to nhr-redaktion@lists.fau.de!)

I heard that RISC-V is “the new thing”.

RISC-V is just a modern, open instruction set architecture (ISA) that does not have the licensing issues of Arm. However, the underlying processor architecture will mainly determine the performance of code. So far, competitive RISC-V designs are nowhere to be seen in HPC, but this may change in the future.

Is it true that Arm processors are now competitive in HPC?

“Arm CPU” just means that it uses an instruction set architecture (ISA) licensed from Arm, but the hardware implementation can vary a lot. There is a plethora of Arm-based designs, some from Arm and many from other vendors. Many target the low-power/embedded market, but there are some which have entered the HPC area. Prominent examples are the Fujitsu A64FX and the Marvell ThunderX2. A TX2 system is available in the NHR test and benchmark cluster.

What about the Apple M1?

It’s positively impressive, in terms of memory bandwidth as well as the architecture of its memory hierarchy. Current models still lack the peak performance needed to be competitive with x86 server CPUs, however.

What is a vector computer?

A vector computer has an ISA and CPU architecture that enable efficient operations on array data. This goes under the name of Single Instruction Multiple Data (SIMD). SIMD features have proliferated in commodity CPUs as well, but a true vector CPU has features that make it more efficient, such as large vector lengths (e.g., 256 elements) and a high memory bandwidth (e.g., 1.5 Tbyte/s). Currently, only NEC offers a true vector processor, the SX-Aurora Tsubasa. A node with two Tsubasa cards is available in the NHR test and benchmark cluster.

Password

How can I change my HPC password?

If you applied for the HPC account using the paper forms: Please log in to www.idm.fau.de with your IdM account and change the HPC password under Services.

Please note that it generally takes a few hours until password changes are known on the HPC systems, the change will not happen at the same time on all clusters.

If you got your HPC account through the new HPC portal, there is no password for the HPC account at all. Authorization at the HPC portal is done using SSO (provided by your home institute) and login to the HPC systems is by SSH keys (which have to be uploaded through the HPC portal). Information on how to upload SSH keys to the HPC portal can be found in our documentation on the HPC portal.

Software environment

Why does my program give a http/https timeout?

When running software, which tries to connect to the internet, on one of the cluster nodes you might encounter time-out errors.
By default we do not allow cluster nodes to access the internet.
However, you can circumvent this by setting a proxy:
export http_proxy=http://proxy:80
export https_proxy=http://proxy:80

Test Cluster

Is it true that Arm processors are now competitive in HPC?

“Arm CPU” just means that it uses an instruction set architecture (ISA) licensed from Arm, but the hardware implementation can vary a lot. There is a plethora of Arm-based designs, some from Arm and many from other vendors. Many target the low-power/embedded market, but there are some which have entered the HPC area. Prominent examples are the Fujitsu A64FX and the Marvell ThunderX2. A TX2 system is available in the NHR test and benchmark cluster.

What is a vector computer?

A vector computer has an ISA and CPU architecture that enable efficient operations on array data. This goes under the name of Single Instruction Multiple Data (SIMD). SIMD features have proliferated in commodity CPUs as well, but a true vector CPU has features that make it more efficient, such as large vector lengths (e.g., 256 elements) and a high memory bandwidth (e.g., 1.5 Tbyte/s). Currently, only NEC offers a true vector processor, the SX-Aurora Tsubasa. A node with two Tsubasa cards is available in the NHR test and benchmark cluster.

How to acknowledge resource usage
In general, use the following formulation in publications for acknowledging the resources and the support of NHR@FAU: for the FAU Tier3 resources: “The authors gratefully acknowledge the scientific support and HPC resources provided by the Erlangen National High Performance Computing Center (NHR@FAU) of the Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU). The hardware is funded by the German Research Foundation (DFG).” for the NHR@FAU resources/projects: “The authors gratefully acknowledge the scientific support and HPC resources provided by the Erlangen National High Performance Computing Center (NHR@FAU) of the Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU) under the NHR project <ID of your NHR@FAU project>. NHR funding is provided by federal and Bavarian state authorities. NHR@FAU hardware is partially funded by the German Research Foundation (DFG) – 440719683.” (Also do not forget to send a copy of your papers to nhr-redaktion@lists.fau.de!)
How can I access the new clusters Alex and Fritz?
FAU staff and students who already have an HPC account can request access to Alex here: https://hpc.fau.de/tier3-access-to-alex/ and access to Fritz here: https://hpc.fau.de/tier3-access-to-fritz/. Access is restricted to  projects with extended demands, thus, not feasible on TinyGPU or Woody/Meggie, but still below the NHR thresholds. You have to prove that and provide a short description of what you want to do there. If you do not have an HPC account, please follow our instructions on “Getting started with HPC“. External scientists have to submit a NHR proposal to get access.
Is it true that Arm processors are now competitive in HPC?
“Arm CPU” just means that it uses an instruction set architecture (ISA) licensed from Arm, but the hardware implementation can vary a lot. There is a plethora of Arm-based designs, some from Arm and many from other vendors. Many target the low-power/embedded market, but there are some which have entered the HPC area. Prominent examples are the Fujitsu A64FX and the Marvell ThunderX2. A TX2 system is available in the NHR test and benchmark cluster.
What is a vector computer?
A vector computer has an ISA and CPU architecture that enable efficient operations on array data. This goes under the name of Single Instruction Multiple Data (SIMD). SIMD features have proliferated in commodity CPUs as well, but a true vector CPU has features that make it more efficient, such as large vector lengths (e.g., 256 elements) and a high memory bandwidth (e.g., 1.5 Tbyte/s). Currently, only NEC offers a true vector processor, the SX-Aurora Tsubasa. A node with two Tsubasa cards is available in the NHR test and benchmark cluster.
What is SMT (also known as hyperthreading)?
Simultaneous multi-threading (SMT) allows a CPU core to run more than one software thread at the same time. These “hardware threads” a.k.a. “virtual cores” share almost all resources. The purpose of this feature is to make better use of the execution units within the core. It is rather hard to predict the benefit of SMT for real applications, so the best strategy is to try it using a well-designed, realistic benchmark case.
What is thread or process affinity?
Modern multicore systems have a strong topology, i.e., groups of hardware threads share different resources such as cores, caches, and memory interfaces. Many performance features of parallel programs depend on where their threads and processes are running in the machine. This makes it vital to bind these threads and processes to hardware threads so that variability is reduced and resources are balanced.
Why does my program give a http/https timeout?
When running software, which tries to connect to the internet, on one of the cluster nodes you might encounter time-out errors. By default we do not allow cluster nodes to access the internet. However, you can circumvent this by setting a proxy: export http_proxy=http://proxy:80 export https_proxy=http://proxy:80
How can I access the cluster frontends?
Almost all HPC systems at NHR@FAU use private IP addresses that can only be accessed directly from within the FAU.There are three options for logging in to the clusters from outside the university: Use a VPN (Virtual Private Network) connection. Use IPv6. The cluster frontends have world-visible IPv6 addresses. Use our “dialog server” cshpc.rrze.fau.de. The dialog server is the only HPC machine with a public IPv4 address. cshpc is a Linux system that is open to all HPC accounts. From this machine, you can log into any NHR@FAU system. A more complete description can be found on our documentation pages. Whichever option you choose, you need to use SSH. Please consult our extensive SSH documentation pages for details.  
How can I access the new clusters Alex and Fritz?
FAU staff and students who already have an HPC account can request access to Alex here: https://hpc.fau.de/tier3-access-to-alex/ and access to Fritz here: https://hpc.fau.de/tier3-access-to-fritz/. Access is restricted to  projects with extended demands, thus, not feasible on TinyGPU or Woody/Meggie, but still below the NHR thresholds. You have to prove that and provide a short description of what you want to do there. If you do not have an HPC account, please follow our instructions on “Getting started with HPC“. External scientists have to submit a NHR proposal to get access.
How to acknowledge resource usage
In general, use the following formulation in publications for acknowledging the resources and the support of NHR@FAU: for the FAU Tier3 resources: “The authors gratefully acknowledge the scientific support and HPC resources provided by the Erlangen National High Performance Computing Center (NHR@FAU) of the Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU). The hardware is funded by the German Research Foundation (DFG).” for the NHR@FAU resources/projects: “The authors gratefully acknowledge the scientific support and HPC resources provided by the Erlangen National High Performance Computing Center (NHR@FAU) of the Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU) under the NHR project <ID of your NHR@FAU project>. NHR funding is provided by federal and Bavarian state authorities. NHR@FAU hardware is partially funded by the German Research Foundation (DFG) – 440719683.” (Also do not forget to send a copy of your papers to nhr-redaktion@lists.fau.de!)
How can I contact the HPC team?
An informal and low-threshold way to talk to members of the HPC team is our regular HPC Café. The HPC Café takes place every second Tuesday of the month at 4:00 p.m. in seminar room 2.049 at RRZE, Martensstr.1, 91058 Erlangen. Due to the Covid-19 pandemic, the HPC Café was replaced by an online consultation hour since early 2020. Details are published on the HPC Café website. Note: Currently we mostly work from home and thus cannot be reached via our office phone numbers. We can arrange virtual appointments by Zoom, MS Teams, or BigBlueButton. You may also call RRZE’s HelpDesk (+49-9131-85-29955) and leave a message for us—but you probably get a faster response by sending us an e-mail (hpc-support@fau.de).
Where can I store my data?
Your home directory is accessible via $HOME. Each user gets a standard quota of 50 Gigabytes and quota extensions are not possible. Additional storage is accessible via $HPCVAULT. Here, the default quota for each user is 500 Gigabytes. The recommended work directory is accessible via $WORK. The standard quota for each user is 500 Gigabytes. All three directories ($HOME, $HPCVAULT and $WORK) are available throughout our HPC systems. We recommend you use the aforementioned variables in your jobscripts and not rely on the specific paths as this may change over time, i.e. when directories are relocated to a different NFS server. Job-specific storage (either located in main memory [RAM disk] or, if available, local HDD / SDD) is accessible via $TMPDIR and always node-local. Size differs between clusters and is only available during job lifetime. Data is flushed after the job finishes! Some of our clusters have a local parallel filesystem for high performance short-term storage that is accessible via $FASTTMP. These filesystems are specific to the clusters and not available on other clusters. This type of storage is not suitable for programs such as MD simulations that have quite high output rates! Please also have a look into our documentation on “File Systems“.
How can I access the new clusters Alex and Fritz?
FAU staff and students who already have an HPC account can request access to Alex here: https://hpc.fau.de/tier3-access-to-alex/ and access to Fritz here: https://hpc.fau.de/tier3-access-to-fritz/. Access is restricted to  projects with extended demands, thus, not feasible on TinyGPU or Woody/Meggie, but still below the NHR thresholds. You have to prove that and provide a short description of what you want to do there. If you do not have an HPC account, please follow our instructions on “Getting started with HPC“. External scientists have to submit a NHR proposal to get access.
How can I access the cluster frontends?
Almost all HPC systems at NHR@FAU use private IP addresses that can only be accessed directly from within the FAU.There are three options for logging in to the clusters from outside the university: Use a VPN (Virtual Private Network) connection. Use IPv6. The cluster frontends have world-visible IPv6 addresses. Use our “dialog server” cshpc.rrze.fau.de. The dialog server is the only HPC machine with a public IPv4 address. cshpc is a Linux system that is open to all HPC accounts. From this machine, you can log into any NHR@FAU system. A more complete description can be found on our documentation pages. Whichever option you choose, you need to use SSH. Please consult our extensive SSH documentation pages for details.  
How can I change my HPC password?
If you applied for the HPC account using the paper forms: Please log in to www.idm.fau.de with your IdM account and change the HPC password under Services. Please note that it generally takes a few hours until password changes are known on the HPC systems, the change will not happen at the same time on all clusters. If you got your HPC account through the new HPC portal, there is no password for the HPC account at all. Authorization at the HPC portal is done using SSO (provided by your home institute) and login to the HPC systems is by SSH keys (which have to be uploaded through the HPC portal). Information on how to upload SSH keys to the HPC portal can be found in our documentation on the HPC portal.
How to acknowledge resource usage
In general, use the following formulation in publications for acknowledging the resources and the support of NHR@FAU: for the FAU Tier3 resources: “The authors gratefully acknowledge the scientific support and HPC resources provided by the Erlangen National High Performance Computing Center (NHR@FAU) of the Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU). The hardware is funded by the German Research Foundation (DFG).” for the NHR@FAU resources/projects: “The authors gratefully acknowledge the scientific support and HPC resources provided by the Erlangen National High Performance Computing Center (NHR@FAU) of the Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU) under the NHR project <ID of your NHR@FAU project>. NHR funding is provided by federal and Bavarian state authorities. NHR@FAU hardware is partially funded by the German Research Foundation (DFG) – 440719683.” (Also do not forget to send a copy of your papers to nhr-redaktion@lists.fau.de!)
I heard that RISC-V is “the new thing”.
RISC-V is just a modern, open instruction set architecture (ISA) that does not have the licensing issues of Arm. However, the underlying processor architecture will mainly determine the performance of code. So far, competitive RISC-V designs are nowhere to be seen in HPC, but this may change in the future.
Is it true that Arm processors are now competitive in HPC?
“Arm CPU” just means that it uses an instruction set architecture (ISA) licensed from Arm, but the hardware implementation can vary a lot. There is a plethora of Arm-based designs, some from Arm and many from other vendors. Many target the low-power/embedded market, but there are some which have entered the HPC area. Prominent examples are the Fujitsu A64FX and the Marvell ThunderX2. A TX2 system is available in the NHR test and benchmark cluster.
What about the Apple M1?
It’s positively impressive, in terms of memory bandwidth as well as the architecture of its memory hierarchy. Current models still lack the peak performance needed to be competitive with x86 server CPUs, however.
What is a vector computer?
A vector computer has an ISA and CPU architecture that enable efficient operations on array data. This goes under the name of Single Instruction Multiple Data (SIMD). SIMD features have proliferated in commodity CPUs as well, but a true vector CPU has features that make it more efficient, such as large vector lengths (e.g., 256 elements) and a high memory bandwidth (e.g., 1.5 Tbyte/s). Currently, only NEC offers a true vector processor, the SX-Aurora Tsubasa. A node with two Tsubasa cards is available in the NHR test and benchmark cluster.
How can I change my HPC password?
If you applied for the HPC account using the paper forms: Please log in to www.idm.fau.de with your IdM account and change the HPC password under Services. Please note that it generally takes a few hours until password changes are known on the HPC systems, the change will not happen at the same time on all clusters. If you got your HPC account through the new HPC portal, there is no password for the HPC account at all. Authorization at the HPC portal is done using SSO (provided by your home institute) and login to the HPC systems is by SSH keys (which have to be uploaded through the HPC portal). Information on how to upload SSH keys to the HPC portal can be found in our documentation on the HPC portal.
Why does my program give a http/https timeout?
When running software, which tries to connect to the internet, on one of the cluster nodes you might encounter time-out errors. By default we do not allow cluster nodes to access the internet. However, you can circumvent this by setting a proxy: export http_proxy=http://proxy:80 export https_proxy=http://proxy:80
Is it true that Arm processors are now competitive in HPC?
“Arm CPU” just means that it uses an instruction set architecture (ISA) licensed from Arm, but the hardware implementation can vary a lot. There is a plethora of Arm-based designs, some from Arm and many from other vendors. Many target the low-power/embedded market, but there are some which have entered the HPC area. Prominent examples are the Fujitsu A64FX and the Marvell ThunderX2. A TX2 system is available in the NHR test and benchmark cluster.
What is a vector computer?
A vector computer has an ISA and CPU architecture that enable efficient operations on array data. This goes under the name of Single Instruction Multiple Data (SIMD). SIMD features have proliferated in commodity CPUs as well, but a true vector CPU has features that make it more efficient, such as large vector lengths (e.g., 256 elements) and a high memory bandwidth (e.g., 1.5 Tbyte/s). Currently, only NEC offers a true vector processor, the SX-Aurora Tsubasa. A node with two Tsubasa cards is available in the NHR test and benchmark cluster.