FAQ - NHR@FAU

Categories: Access | Usage | Software | Hardware | General information

Access

Alex Cluster

How can I attach to a running Slurm job

To attach to a running Slurm job, use srun --pty --overlap --jobid YOUR-JOBID bash. This will give you a shell on the first node of your job and you can run top, nvidia-smi, etc. to check your job. This is an alternative to SSH-ing into your node. Using srun to attach to a job is the only way to see the correct GPU if you have multiple GPU jobs running on a single node as SSH will always get you into last modified cgroup which might not be the job / GPUs you are looking for.

Basic HPC Knowledge

Why does my program give a http/https timeout?

When running software, which tries to connect to the internet, on one of the compute nodes of our cluster you might encounter time-out errors.
By default we do not allow compute nodes of the clusters to access the internet. Only the login nodes do NAT.
However, you can circumvent this for HTTP/HTTPS connections by setting a proxy:
export http_proxy=http://proxy:80
export https_proxy=http://proxy:80

Some software may expect instead

export HTTP_PROXY=http://proxy:80
export HTTPS_PROXY=http://proxy:80

Batch System

How can I attach to a running Slurm job

This is an alternative to SSH-ing into your node.

Using srun to attach to a job is the only way to see the correct GPU if you have multiple GPU jobs running on a single node as SSH will always get you into last modified cgroup which might not be the job / GPUs you are looking for.

Cluster Access

I would like to cross use data between HPC accounts

Please be careful when dealing with permissions!

see: man nfs4_acl/nfs4_getfacl/nfs4_setfacl/chmod

This might be the case if you own a Tier2 and a Tier3 account or to share data with a co-worker (group member).

In the following example USER1 will grand access for USER2:

It can be done in a two step process, open your folder using nfs4acl
USER1@frontend: nfs4_setfacl -a A::USER2@rrze.uni-erlangen.de:rwaDxtcy $WOODYHOME
replace USER with your Tier2 or Tier3 account name.

You can add an entire group with
nfs4_setfacl -a A:g:GROUP@rrze.uni-erlangen.de:RX FOLDER

As a quick&dirty alternativ you may use USER1@frontend: chmod o+x $WOODYHOME
Please note this will open the door for EVERY HPC user on the system. They will be able to access your data if they know file/folder names.

Use POSIX permissions for child folder/files, you need to set permissions for others “o”
USER1@frontend: chmod --recursive o+rx $WOODYHOME/testing
or user of the same group “g”
USER1@frontend: chmod --recursive g+rx $WOODYHOME/testing

You may have to modify umask to ensure that new files are created with proper permissions.

If everything worked permissions should look like this:
USER1@frontend: nfs4_getfacl $WOODYHOME
# file: /home/woody/$USER/$GROUP
A::OWNER@:rwaDxtTcCy
A::USER2@rrze.uni-erlangen.de:rwaDxtcy <-- please note UID are sometimes not resolved here, use "id UID" to check
A::GROUP@:rxtcy
A::EVERYONE@:tcy

USER1@frontend: ls -alh /home/woody/$GROUP/$USER/testing
-rwxr-xr-x 1 USER1 GROUP 115 Jan 11 14:24 testing

How can I attach to a running Slurm job

This is an alternative to SSH-ing into your node.

I cannot find my HPC account in the HPC portal

If you cannot find your HPC account in the HPC-portal, it is likely that your account is still managed via the “HPC paper form” (or the simplified process for lectures) and these types of accounts are not visible in the HPC-portal but the IdM portal (https://idm.fau.de/).

Any HPC account which still has a password is still managed through the IdM portal. Once these accounts are migrated to the HPC-portal, you’ll be informed and the HPC account looses its password. Migration will take place in waves, group by group.

I received an invite mail from the HPC-Portal but there is no account data visible

Please check the email address that is transmitted via SSO.

Login to the portal and click on your SSO name in the upper right corner. Go to “Profile” and check the email address that is visible on the left side below “Personal data”.

We can only match invitations that have been sent to the email address shown in your profile. If you see an empty account, please ask for the invitation to be resend to the correct email address.

ClusterCockpit

How can I access a link to monitoring.nhr.fau.de?

For HPC portal users (i.e., who have accounts without a password), the job-specific monitoring of ClusterCockpit is only accessible via the HPC portal.

Please login to HPC portal first and then follow the link to ClusterCockpit to have a valid ClusterCockpit session; unless you have already done so recently. Sessions are valid for several hours/days.

FileSystems/Data Storage

I would like to cross use data between HPC accounts

Please be careful when dealing with permissions!

see: man nfs4_acl/nfs4_getfacl/nfs4_setfacl/chmod

This might be the case if you own a Tier2 and a Tier3 account or to share data with a co-worker (group member).

In the following example USER1 will grand access for USER2:

You can add an entire group with
nfs4_setfacl -a A:g:GROUP@rrze.uni-erlangen.de:RX FOLDER

You may have to modify umask to ensure that new files are created with proper permissions.

USER1@frontend: ls -alh /home/woody/$GROUP/$USER/testing
-rwxr-xr-x 1 USER1 GROUP 115 Jan 11 14:24 testing

GPU usage

How can I attach to a running Slurm job

This is an alternative to SSH-ing into your node.

HPC-Portal

How can I access a link to monitoring.nhr.fau.de?

For HPC portal users (i.e., who have accounts without a password), the job-specific monitoring of ClusterCockpit is only accessible via the HPC portal.

I cannot find my HPC account in the HPC portal

I have a Tier3-project in the HPC portal. Which of the project categories is the correct one for the new account I want to add?

There are three default Tier3-project categories in the HPC portal:

###100: Tier3 Grundversorgung¹ <name of institute/department> (Prof. <name>)—that’s for staff, PhD students with a contract at FAU (“contract” does not necessarily mean “employment”), employed student assistants, etc.
###101: Studentische Abschlußarbeiten² Tier3 Grundversorgung¹ <name of institute/department> (Prof. <name>)—that’s for students doing their bachelor/master thesis or some sort of study-related project work (i.e., coursework) but usually not for tutorials of a lecture (that would be a separate category⁴); doctoral theses do not fall into this category (that would be basic supply ###100)
###102: Projektpartner³ Tier3 Grundversorgung¹ <name of institute/department> (Prof. <name>)—that’s for people not belonging to FAU, i.e., all people who should have access to HPC resources as part of your Tier3 HPC basic provision but who are neither students nor employees/scholarship holders/etc. at FAU; guest researchers or former employees, for example, fall into this category

If you wish to change the project category of an HPC identifier within your group, please inform us by sending an informal email to hpc-support@fau.de. The different project categories are currently used for statistical purposes only. In the future, however, they will also be used to control the frequency of re-validation (=forced SSO login in the HPC portal), for example.

[1] Basic supply [2] Student theses [3] Project partners
[4] In some cases, there is ###103 for teaching activities but usually, there will be a dedicated project with a separate name space.

I received an invite mail from the HPC-Portal but there is no account data visible

Please check the email address that is transmitted via SSO.

Login to the portal and click on your SSO name in the upper right corner. Go to “Profile” and check the email address that is visible on the left side below “Personal data”.

We can only match invitations that have been sent to the email address shown in your profile. If you see an empty account, please ask for the invitation to be resend to the correct email address.

Miscellaneous

I would like to cross use data between HPC accounts

Please be careful when dealing with permissions!

see: man nfs4_acl/nfs4_getfacl/nfs4_setfacl/chmod

This might be the case if you own a Tier2 and a Tier3 account or to share data with a co-worker (group member).

In the following example USER1 will grand access for USER2:

You can add an entire group with
nfs4_setfacl -a A:g:GROUP@rrze.uni-erlangen.de:RX FOLDER

You may have to modify umask to ensure that new files are created with proper permissions.

USER1@frontend: ls -alh /home/woody/$GROUP/$USER/testing
-rwxr-xr-x 1 USER1 GROUP 115 Jan 11 14:24 testing

I cannot find my HPC account in the HPC portal

I received an invite mail from the HPC-Portal but there is no account data visible

Please check the email address that is transmitted via SSO.

Login to the portal and click on your SSO name in the upper right corner. Go to “Profile” and check the email address that is visible on the left side below “Personal data”.

We can only match invitations that have been sent to the email address shown in your profile. If you see an empty account, please ask for the invitation to be resend to the correct email address.

Password

SSH is asking for a password

If you got your HPC account through the new HPC portal (https://portal.hpc.fau.de) because it is for example an NHR project, there is NO password for such an HPC account. You log into the HPC portal using your SSO credentials (of your university). Access to the HPC systems with your HPC account created through the portal is by SSH keys only. The SSH public key is uploaded to the HPC portal and it will take a couple of hours until all HPC systems know a new/changed SSH public key. Multiple SSH public keys can be uploaded.

What is the password of my HPC account?

A) If you got your HPC account through the new HPC portal (https://portal.hpc.fau.de) because it is for example an NHR project, there is NO password for such an HPC account. You log into the HPC portal using your SSO credentials (of your university). Access to the HPC systems with your HPC account created through the portal is by SSH keys only. The SSH public key is uploaded to the HPC portal and it will take a couple of hours until all HPC systems know a new/changed SSH public key. Multiple SSH public keys can be uploaded.

B) If you have an FAU IdM account and applied for your HPC account with the HPC paper form, you can set a dedicated password for the HPC account through the IdM portal (https://idm.fau.de). It will take a couple of hours until all HPC systems know the changed password.

Python/Conda

Why does my program give a http/https timeout?

Some software may expect instead

export HTTP_PROXY=http://proxy:80
export HTTPS_PROXY=http://proxy:80

SSH

Debugging SSH problems

The get more information on SSH problems, add the “-v” option to SSH. This will give moderate debug information, e.g. show which SSH keys are tried.

Here is a sample output

max@notebook:~$ ssh -v cshpc
OpenSSH_7.6p1 Ubuntu-4ubuntu0.7, OpenSSL 1.0.2n 7 Dec 2017
debug1: Reading configuration data /home/max/.ssh/config
debug1: Reading configuration data /etc/ssh/ssh_config
debug1: /etc/ssh/ssh_config line 19: Applying options for *
debug1: Connecting to cshpc.rrze.uni-erlangen.de [131.188.3.39] port 22.
debug1: Connection established.
debug1: identity file /home/max/.ssh/id_rsa type 0
debug1: key_load_public: No such file or directory
debug1: identity file /home/max/.ssh/id_ecdsa type -1
debug1: key_load_public: No such file or directory
debug1: Local version string SSH-2.0-OpenSSH_7.6p1 Ubuntu-4ubuntu0.7
debug1: Remote protocol version 2.0, remote software version OpenSSH_8.2p1 Ubuntu-4ubuntu0.5
debug1: match: OpenSSH_8.2p1 Ubuntu-4ubuntu0.5 pat OpenSSH* compat 0x04000000
debug1: Authenticating to cshpc.rrze.uni-erlangen.de:22 as 'unrz143'
debug1: SSH2_MSG_KEXINIT sent
debug1: SSH2_MSG_KEXINIT received
debug1: kex: algorithm: curve25519-sha256
debug1: kex: host key algorithm: ecdsa-sha2-nistp256
debug1: kex: server->client cipher: chacha20-poly1305@openssh.com MAC: <implicit> compression: none
debug1: kex: client->server cipher: chacha20-poly1305@openssh.com MAC: <implicit> compression: none
debug1: expecting SSH2_MSG_KEX_ECDH_REPLY
debug1: Server host key: ecdsa-sha2-nistp256 SHA256:wFaDywle3yJvygQ4ZAPDsi/iSBTaF6Uoo0i0z727aJU
debug1: Host 'cshpc.rrze.uni-erlangen.de' is known and matches the ECDSA host key.
debug1: Found key in /home/max/.ssh/known_hosts:215
debug1: rekey after 134217728 blocks
debug1: SSH2_MSG_NEWKEYS sent
debug1: expecting SSH2_MSG_NEWKEYS
debug1: SSH2_MSG_NEWKEYS received
debug1: rekey after 134217728 blocks
debug1: SSH2_MSG_EXT_INFO received
debug1: kex_input_ext_info: server-sig-algs=<ssh-ed25519,sk-ssh-ed25519@openssh.com,ssh-rsa,rsa-sha2-256,rsa-sha2-512,ssh-dss,ecdsa-sha2-nistp256,ecdsa-sha2-nistp384,ecdsa-sha2-nistp521,sk-ecdsa-sha2-nistp256@openssh.com>
debug1: SSH2_MSG_SERVICE_ACCEPT received
debug1: Authentications that can continue: publickey,password
debug1: Next authentication method: publickey
debug1: Offering public key: RSA SHA256:mWO4eYar1/JYn8MDB0DPer+ibB/QatmhxvvngfaoMgQ /home/max/.ssh/id_rsa
debug1: Server accepts key: pkalg rsa-sha2-512 blen 277
debug1: Authentication succeeded (publickey).
Authenticated to cshpc.rrze.uni-erlangen.de ([131.188.3.39]:22).
debug1: channel 0: new [client-session]
debug1: Requesting no-more-sessions@openssh.com
debug1: Entering interactive session.
debug1: pledge: network
debug1: client_input_global_request: rtype hostkeys-00@openssh.com want_reply 0
debug1: Sending environment.
debug1: Sending env LC_ALL = en_US.UTF-8
debug1: Sending env LANG = en_US.UTF-8

To check the fingerprint of your SSH key, use

max@notebook:~$ ssh-keygen -l -f ~/.ssh/id_rsa
2048 SHA256:mWO4eYar1/JYn8MDB0DPer+ibB/QatmhxvvngfaoMgQ max@notebook (RSA)

This fingerprint must also match the data shown in the HPC portal (if your SSH keys are managed by the HPC portal).

In the debug output I find the following

debug1: Authentications that can continue: publickey,password
debug1: Next authentication method: publickey
debug1: Offering public key: /home/max/.ssh/id_rsa RSA SHA256:xCyJUQcsJldPWfZXSasoI0ZCoteKWHw1e95ylm2HK1g agent
debug1: Server accepts key: /home/max/.ssh/id_rsa RSA SHA256:xCyJUQcsJldPWfZXSasoI0ZCoteKWHw1e95ylm2HK1g agent
sign_and_send_pubkey: signing failed for RSA "/home/max/.ssh/id_rsa" from agent: agent refused operation
debug1: Next authentication method: password

This message is miss leading. “sign_and_send_pubkey: signing failed … agent refused operation” typically means that you entered a wrong passphrase for the SSH key.

How can I attach to a running Slurm job

This is an alternative to SSH-ing into your node.

I managed to log in to cshpc (with an SSH key) but get asked for a password / permission denied when continuing to a cluster frontend

The explanation is rather simple: the dialog server cshpc does not know any SSH (private) key from you, thus, fails to do SSH key-based authentication when connecting to one of the cluster frontends and, thus, tries password authentication as fallback.

There are a couple of solutions to mitigate that:

Use the “jump host”/”proxy jump” feature of SSH and directly connect to the cluster frontends through the dialog server cshpc. To do this, either use the command line option “-j” of recent SSH versions or use an ~/.ssh/config file on your local computer. See https://hpc.fau.de/systems-services/documentation-instructions/ssh-secure-shell-access-to-hpc-systems/#ssh_config_hpc_portal for templates.
Create an additional SSH key pair on cshpc and add the corresponding SSH public key to the HPC portal (if your account is already managed trough the HPC portal) – or add it to ~/.ssh/authorized_keys (which will only be a temporary solution until all HPC accounts are migrated to the HPC portal)
Use an SSH agent on your local computer and allow it to forward its connection to our dialog server cshpc.

All there ways make sure that cshpc has a SSH private key available when connecting to the cluster frontends.

My HPC account just has been created but I cannot login or Slurm rejects my jobs

Home directories and entries to the Slurm data base are only done once per day (in the late morning). Thus, be patent and wait for the next day at 10 o’clock if your HPC account has just been (re)created.

As with SSH key updates (and password updates for legacy accounts), the processes run on different servers at different times. Thus, before the next day, some of your directories or services may have already been created while others aren’t.

My just updated SSH key (from the HPC portal) or password (from the IdM passwort) is not accepted

It always takes a couple of hours for updated SSH keys to be propagated to all HPC systems. As the clusters are synchronized at different points in time, it may happen that one system already knows the update while others don’t. It typically takes 2-4 hours for an updated to be propagated to all systems.

The same is true for the propagation of HPC password for accounts created through the FAU IdM portal using paper applications.

SSH is asking for a password

What is the password of my HPC account?

Slurm

How can I attach to a running Slurm job

This is an alternative to SSH-ing into your node.

My HPC account just has been created but I cannot login or Slurm rejects my jobs

Slurm options get ignored when given as sbatch command line arguments

I give some Slurm options as command line arguments to sbatch, but they are ignored!?

The syntax of sbatch is: sbatch [OPTIONS(0)...] [ : [OPTIONS(N)...]] script(0) [args(0)...]

Thus, options for sbatch have to be given before the batch script. Arguments given after the batch script are used as arguments for the batch script and not for sbatch.

Software environment

Why does my program give a http/https timeout?

Some software may expect instead

export HTTP_PROXY=http://proxy:80
export HTTPS_PROXY=http://proxy:80

TinyGPU Cluster

How can I attach to a running Slurm job

This is an alternative to SSH-ing into your node.

Usage

Acknowledgement

How to acknowledge resource usage

In general, use the following formulation in publications for acknowledging the resources and the support of NHR@FAU:

for the FAU Tier3 resources: “The authors gratefully acknowledge the scientific support and HPC resources provided by the Erlangen National High Performance Computing Center (NHR@FAU) of the Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU). The hardware is funded by the German Research Foundation (DFG).”
for the NHR@FAU resources/projects: “The authors gratefully acknowledge the scientific support and HPC resources provided by the Erlangen National High Performance Computing Center (NHR@FAU) of the Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU) under the NHR project <ID of your NHR@FAU project>. NHR funding is provided by federal and Bavarian state authorities. NHR@FAU hardware is partially funded by the German Research Foundation (DFG) – 440719683.”

(Also do not forget to send a copy of your papers to nhr-redaktion@lists.fau.de!)

Alex Cluster

How can I access the new clusters Alex and Fritz?

FAU staff and students who already have an HPC account can request access to Alex here: https://hpc.fau.de/tier3-access-to-alex/ and access to Fritz here: https://hpc.fau.de/tier3-access-to-fritz/. Access is restricted to projects with extended demands, thus, not feasible on TinyGPU or Woody/Meggie, but still below the NHR thresholds. You have to prove that and provide a short description of what you want to do there.

If you do not have an HPC account, please follow our instructions on “Getting started with HPC“.

External scientists have to submit a NHR proposal to get access.

How can I attach to a running Slurm job

This is an alternative to SSH-ing into your node.

How can I request an interactive job on Alex?

Interactive jobs can be requested by using salloc and specifying the respective options on the command line.

The following will give you an interactive shell on one of the A40 nodes for one hour:
salloc --gres=gpu:a40:1 --partition=a40 --time=01:00:00

Note that settings from the calling shell (e.g. loaded module paths) will be inherited by the interactive job!

This and more information can be found in our documentation on Alex.

How can I run my job on the cluster?

To submit a job to one of our cluster, you first have to login to a cluster frontend. The compute nodes are not directly accessible and we have a batch system running that handles the queuing of jobs into different partitions (depending on the needed resources, e.g. runtime) and sorting according to some priority scheme. A job will run when the required resources become available.

Please do not run your jobs on the cluster frontends!

The login nodes are not suitable for computational work, since they are shared among all users. We do not allow MPI-parallel applications on the frontends and short parallel test runs must be performed using batch jobs. It is possible to submit interactive batch jobs that, when started, open a shell on one of the assigned compute nodes and let you run interactive programs there.

Please consult our documentation for details about Batch Processing.

We also provide general job script examples for parallel jobs and GPU jobs; however, we have also prepared more specific job scripts for applications that our users frequently run on the clusters.

I have to analyze over 2 million files in my job. What can I do?

Please go through the presentation we provide on Using File Systems Properly; there is also a video recording available.

If supported by the application, use containerized formats (e.g. HDF5) or file-based databases. Otherwise, pack your files into an archive (e.g. tar + optional compression) and use node-local storage that is accessible via $TMPDIR.

The software I need is not installed. What can I do now?

On all HPC systems, established tools for software development (compilers, editors, …), libraries, and selected applications are available. For many of these applications, it is necessary to set special environment variables, so that e.g. search paths are correct or license servers can be found.

To ease selection of and switching between different versions of software packages, all HPC systems use the modules system (cf. modules.sourceforge.net). It allows to conveniently load the necessary configurations for different programs or different versions of the same program an, if necessary, unload them again later.

For information on how to use the modules system, please have a look into the respective section in our documentation about “Software environment“.

Even more packages will become visible once one of the 000-all-spack-pkgs modules has been loaded. Most of the software is installed using “Spack“ as enhanced HPC package manager.

Feel free to compile software in the versions and with the options you need yourself. This is perfectly fine, yet support for self-installed software cannot be granted. We only can provide software centrally which is of importance for multiple groups. If you want to use Spack for compiling additional software, you can load our user-spack module to make use of the packages we already build with Spack if the concretization matches instead of starting from scratch. Once user-spack is loaded, the command spack will be available (as alias), you will inherit the pre-sets we defined for certain packages (e.g. Open MPI to work with Slurm), but you’ll install everything into your own directories ($WORK/USER-SPACK).

You can also bring your own environment in a container using Singularity. However, building Singularity containers on the HPC systems themselves is not supported (as that would require root access). The Infiniband drivers from the host are not mounted into your container. All filesystems will also be available by default in the container. In certain use cases it might be a good idea to avoid bind-mounting your normal $HOME directory with all its “dot directories” into the container by explicitly specifying a different directory, e.g. -H $HOME/my-container-home.

Why is my code not using the GPU on Alex?

CUDA is not installed as part of the OS – you have to load a cuda module for your binaries to find libcublas, libcudnn, etc.

Please also have a look at our documentation “Working with NVIDIA GPUs“.

Why is my pytorch/tensorflow using CPU only?

The most common mistake is to build the conda/python-venv environment on a fontend and not on a cluster node with GPU.

You can start an interactive job on Alex with salloc --gres=gpu:a40:1 --partition=a40 --time=02:00:00
and on TinyGPU with salloc.tinygpu --gres=gpu:1 --time=02:00:00

Basic HPC Knowledge

I have to analyze over 2 million files in my job. What can I do?

Please go through the presentation we provide on Using File Systems Properly; there is also a video recording available.

What is a parallel file system?

In a parallel file system (PFS), data is distributed not only across several disks but also multiple servers in order to increase the data access bandwidth. Most PFS’s are connected to the high-speed network of a cluster, and aggregated bandwidths in the TByte/s range are not uncommon. High bandwidth can, however, only be obtained with large files and streaming access.

For information on how to use a parallel file system on our clusters, please read our documentation on “Parallel file systems $FASTTMP“.

What is SMT (also known as hyperthreading)?

Simultaneous multi-threading (SMT) allows a CPU core to run more than one software thread at the same time. These “hardware threads” a.k.a. “virtual cores” share almost all resources. The purpose of this feature is to make better use of the execution units within the core. It is rather hard to predict the benefit of SMT for real applications, so the best strategy is to try it using a well-designed, realistic benchmark case.

What is thread or process affinity?

Modern multicore systems have a strong topology, i.e., groups of hardware threads share different resources such as cores, caches, and memory interfaces. Many performance features of parallel programs depend on where their threads and processes are running in the machine. This makes it vital to bind these threads and processes to hardware threads so that variability is reduced and resources are balanced.

Why does my program give a http/https timeout?

Some software may expect instead

export HTTP_PROXY=http://proxy:80
export HTTPS_PROXY=http://proxy:80

Why is my pytorch/tensorflow using CPU only?

The most common mistake is to build the conda/python-venv environment on a fontend and not on a cluster node with GPU.

You can start an interactive job on Alex with salloc --gres=gpu:a40:1 --partition=a40 --time=02:00:00
and on TinyGPU with salloc.tinygpu --gres=gpu:1 --time=02:00:00

Why should I care about file systems?

Not only may efficient file operations speed up your own code (if file I/O is what you must do); they will also reduce the burden on shared file servers and thus leave more performance headroom for other users of the resource. Hence, it is a matter of thoughtfulness to optimize file accesses even if your performance gain is marginal.

Batch System

How can I attach to a running Slurm job

This is an alternative to SSH-ing into your node.

How can I request an interactive job on Alex?

Interactive jobs can be requested by using salloc and specifying the respective options on the command line.

The following will give you an interactive shell on one of the A40 nodes for one hour:
salloc --gres=gpu:a40:1 --partition=a40 --time=01:00:00

Note that settings from the calling shell (e.g. loaded module paths) will be inherited by the interactive job!

This and more information can be found in our documentation on Alex.

How can I request an interactive job on Fritz?

Interactive jobs can be requested by using salloc and specifying the respective options on the command line.

The following will give you an interactive shell on one node for one hour:
salloc -N 1 --partition=singlenode --time=01:00:00

The following will give you four nodes with an interactive shell on the first node for one hour:
salloc -N 4 --partition=multinode --time=01:00:00

Settings from the calling shell (e.g. loaded module paths) will be inherited by the interactive job!

This and more information can be found in our documentation on Fritz.

How can I request an interactive job on TinyGPU?

Interactive Slurm Shell (RTX2080Ti, RTX3080, V100 and A100 nodes only)

To generate an interactive Slurm shell on one of the compute nodes, the following command has to be issued on the woody frontend:
salloc.tinygpu --gres=gpu:1 --time=00:30:00

This will give you an interactive shell for 30 minutes on one of the nodes, allocating 1 GPU and the respective number of CPU cores. There, you can then for example compile your code or do test runs of your binary. For MPI-parallel binaries, use sruninstead of mpirun.

Please note that sallocautomatically exports the environment of your shell on the login node to your interactive job. This can cause problems if you have loaded any modules due to the version differences between the woody frontend and the TinyGPU compute nodes. To mitigate this, purge all loaded modules via module purge before issuing the salloc command.

This and more information can be found in our documentation about TinyGPU.

How can I request an interactive job on Woody-NG?

Interactive jobs can be requested by using salloc and specifying the respective options on the command line.

The following will give you an interactive shell on one node with one core dedicated to you for one hour:
salloc -n 1 --time=01:00:00

Settings from the calling shell (e.g. loaded module paths) will be inherited by the interactive job!

This and more information can be found in our documentation about Woody-NG.

How can I run my job on the cluster?

Please do not run your jobs on the cluster frontends!

Please consult our documentation for details about Batch Processing.

We also provide general job script examples for parallel jobs and GPU jobs; however, we have also prepared more specific job scripts for applications that our users frequently run on the clusters.

I have to analyze over 2 million files in my job. What can I do?

Please go through the presentation we provide on Using File Systems Properly; there is also a video recording available.

CUDA

Why is my code not running on the GPU in TinyGPU?

Do not try to build a GPU-enabled Tensorflow, pytorch, … on the Woody login nodes. That will fail as the Woody login nodes do not have Nvidia software installed (moreover, most TinyGPU nodes run a newer OS version).

Please have look at our documentation about “Working with NVIDIA GPUs“.

Why is my code not using the GPU on Alex?

CUDA is not installed as part of the OS – you have to load a cuda module for your binaries to find libcublas, libcudnn, etc.

Please also have a look at our documentation “Working with NVIDIA GPUs“.

Why is my pytorch/tensorflow using CPU only?

The most common mistake is to build the conda/python-venv environment on a fontend and not on a cluster node with GPU.

You can start an interactive job on Alex with salloc --gres=gpu:a40:1 --partition=a40 --time=02:00:00
and on TinyGPU with salloc.tinygpu --gres=gpu:1 --time=02:00:00

Cluster Access

I would like to cross use data between HPC accounts

Please be careful when dealing with permissions!

see: man nfs4_acl/nfs4_getfacl/nfs4_setfacl/chmod

This might be the case if you own a Tier2 and a Tier3 account or to share data with a co-worker (group member).

In the following example USER1 will grand access for USER2:

You can add an entire group with
nfs4_setfacl -a A:g:GROUP@rrze.uni-erlangen.de:RX FOLDER

You may have to modify umask to ensure that new files are created with proper permissions.

USER1@frontend: ls -alh /home/woody/$GROUP/$USER/testing
-rwxr-xr-x 1 USER1 GROUP 115 Jan 11 14:24 testing

How can I access the cluster frontends?

Almost all HPC systems at NHR@FAU use private IP addresses that can only be accessed directly from within the FAU.There are three options for logging in to the clusters from outside the university:

Use a VPN (Virtual Private Network) connection.
Use IPv6. The cluster frontends have world-visible IPv6 addresses.
Use our “dialog server” cshpc.rrze.fau.de. The dialog server is the only HPC machine with a public IPv4 address. cshpc is a Linux system that is open to all HPC accounts. From this machine, you can log into any NHR@FAU system. A more complete description can be found on our documentation pages.

Whichever option you choose, you need to use SSH. Please consult our extensive SSH documentation pages for details.

How can I access the new clusters Alex and Fritz?

If you do not have an HPC account, please follow our instructions on “Getting started with HPC“.

External scientists have to submit a NHR proposal to get access.

How can I attach to a running Slurm job

This is an alternative to SSH-ing into your node.

How can I get access to HPC systems?

Getting an HPC account

Depending on the status, there are different protocols to get an HPC account:

NHR users from outside FAU; See the page on NHR application rules for up-to-date information on allocating resources of NHR@FAU.
Also check the pages on the NHR@FAU HPC-Portal Usage /New digital workflow for HPC accounts.
FAU staff and students (except for lectures): use the HPC application form. Details on how to fill the form are given below. Basic usage of the HPC systems typically is free of charge for FAU researchers for publicly funded research. For compute needs beyond the free basic usage see the page on NHR application rules for preliminary information on allocating resources of NHR@FAU.
Lectures of FAU with need for HPC access: there is a simplified protocol to get HPC accounts for all students of your course. Lecturer have to approach HPC support with a list of the IdM accounts of all students of the course and the course name. Late registrations of additional students are not possible. Thus, be sure to collect all IdM accounts before sending the list to RRZE.
Block courses with external participants: Lecturer have to approach HPC support at least one week in advance with title and date of the course, and the expected number of participants. Such accounts cannot be valid for more than one week.

The HPC application form for FAU within HPC4FAU

You can get the application form here: HPC application form. Applications always have to be approved by your local chair / institute – we do not serve private persons. If you have any questions regarding the application, please contact your local IT contact person at the chair / institute.

You need to fill out the application form, print it, sign it and let it be stamped with the Chair or Institute seal.

Once it is ready, you can bring it by the RRZE Service Desk or send it via Email, or internal mail.

Please visit our documentation about Getting started with HPC for more information.

How can I run my job on the cluster?

Please do not run your jobs on the cluster frontends!

Please consult our documentation for details about Batch Processing.

We also provide general job script examples for parallel jobs and GPU jobs; however, we have also prepared more specific job scripts for applications that our users frequently run on the clusters.

How to acknowledge resource usage

In general, use the following formulation in publications for acknowledging the resources and the support of NHR@FAU:

for the FAU Tier3 resources: “The authors gratefully acknowledge the scientific support and HPC resources provided by the Erlangen National High Performance Computing Center (NHR@FAU) of the Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU). The hardware is funded by the German Research Foundation (DFG).”
for the NHR@FAU resources/projects: “The authors gratefully acknowledge the scientific support and HPC resources provided by the Erlangen National High Performance Computing Center (NHR@FAU) of the Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU) under the NHR project <ID of your NHR@FAU project>. NHR funding is provided by federal and Bavarian state authorities. NHR@FAU hardware is partially funded by the German Research Foundation (DFG) – 440719683.”

(Also do not forget to send a copy of your papers to nhr-redaktion@lists.fau.de!)

I cannot find my HPC account in the HPC portal

I received an invite mail from the HPC-Portal but there is no account data visible

Please check the email address that is transmitted via SSO.

Login to the portal and click on your SSO name in the upper right corner. Go to “Profile” and check the email address that is visible on the left side below “Personal data”.

We can only match invitations that have been sent to the email address shown in your profile. If you see an empty account, please ask for the invitation to be resend to the correct email address.

The software I need is not installed. What can I do now?

For information on how to use the modules system, please have a look into the respective section in our documentation about “Software environment“.

Even more packages will become visible once one of the 000-all-spack-pkgs modules has been loaded. Most of the software is installed using “Spack“ as enhanced HPC package manager.

ClusterCockpit

How can I access a link to monitoring.nhr.fau.de?

For HPC portal users (i.e., who have accounts without a password), the job-specific monitoring of ClusterCockpit is only accessible via the HPC portal.

FileSystems/Data Storage

I would like to cross use data between HPC accounts

Please be careful when dealing with permissions!

see: man nfs4_acl/nfs4_getfacl/nfs4_setfacl/chmod

This might be the case if you own a Tier2 and a Tier3 account or to share data with a co-worker (group member).

In the following example USER1 will grand access for USER2:

You can add an entire group with
nfs4_setfacl -a A:g:GROUP@rrze.uni-erlangen.de:RX FOLDER

You may have to modify umask to ensure that new files are created with proper permissions.

USER1@frontend: ls -alh /home/woody/$GROUP/$USER/testing
-rwxr-xr-x 1 USER1 GROUP 115 Jan 11 14:24 testing

How can I leverage node-local storage on TinyGPU to increase job performance?

Each node has at least 880 GB of local SSD capacity for temporary files under $TMPDIR.

The directory $TMPDIR will be deleted automatically as soon as the user has no jobs running on the node any more.

Data to be kept can be copied to a cluster-wide volume at the end of the job.

Please also read our documentation on “File Systems“.

I have to analyze over 2 million files in my job. What can I do?

Please go through the presentation we provide on Using File Systems Properly; there is also a video recording available.

What is a parallel file system?

For information on how to use a parallel file system on our clusters, please read our documentation on “Parallel file systems $FASTTMP“.

Where can I store my data?

Your home directory is accessible via $HOME. Each user gets a standard quota of 50 Gigabytes and quota extensions are not possible.

Additional storage is accessible via $HPCVAULT. Here, the default quota for each user is 500 Gigabytes.

The recommended work directory is accessible via $WORK. The standard quota for each user is 500 Gigabytes.

All three directories ($HOME, $HPCVAULT and $WORK) are available throughout our HPC systems.

We recommend you use the aforementioned variables in your jobscripts and not rely on the specific paths as this may change over time, i.e. when directories are relocated to a different NFS server.

Job-specific storage (either located in main memory [RAM disk] or, if available, local HDD / SDD) is accessible via $TMPDIR and always node-local. Size differs between clusters and is only available during job lifetime. Data is flushed after the job finishes!

Some of our clusters have a local parallel filesystem for high performance short-term storage that is accessible via $FASTTMP. These filesystems are specific to the clusters and not available on other clusters. This type of storage is not suitable for programs such as MD simulations that have quite high output rates!

Please also have a look into our documentation on “File Systems“.

Why should I care about file systems?

Why the need for several file systems?

Different file systems have different features; for example, a central NFS server has massive bytes for the buck but limited data bandwidth, while a parallel file system is much faster but smaller and usually available to one cluster only. A node-local SSD, one the other hand, has the advantage of very low latency but it cannot be accessed from outside a compute node.

For further information, please consult our documentation on “File systems“.

Fritz Cluster

How can I access the new clusters Alex and Fritz?

If you do not have an HPC account, please follow our instructions on “Getting started with HPC“.

External scientists have to submit a NHR proposal to get access.

How can I request an interactive job on Fritz?

Interactive jobs can be requested by using salloc and specifying the respective options on the command line.

The following will give you an interactive shell on one node for one hour:
salloc -N 1 --partition=singlenode --time=01:00:00

The following will give you four nodes with an interactive shell on the first node for one hour:
salloc -N 4 --partition=multinode --time=01:00:00

Settings from the calling shell (e.g. loaded module paths) will be inherited by the interactive job!

This and more information can be found in our documentation on Fritz.

How can I run my job on the cluster?

Please do not run your jobs on the cluster frontends!

Please consult our documentation for details about Batch Processing.

We also provide general job script examples for parallel jobs and GPU jobs; however, we have also prepared more specific job scripts for applications that our users frequently run on the clusters.

The software I need is not installed. What can I do now?

For information on how to use the modules system, please have a look into the respective section in our documentation about “Software environment“.

Even more packages will become visible once one of the 000-all-spack-pkgs modules has been loaded. Most of the software is installed using “Spack“ as enhanced HPC package manager.

GPU usage

How can I attach to a running Slurm job

This is an alternative to SSH-ing into your node.

How can I run my job on the cluster?

Please do not run your jobs on the cluster frontends!

Please consult our documentation for details about Batch Processing.

We also provide general job script examples for parallel jobs and GPU jobs; however, we have also prepared more specific job scripts for applications that our users frequently run on the clusters.

Why is my code not running on the GPU in TinyGPU?

Please have look at our documentation about “Working with NVIDIA GPUs“.

Why is my code not using the GPU on Alex?

CUDA is not installed as part of the OS – you have to load a cuda module for your binaries to find libcublas, libcudnn, etc.

Please also have a look at our documentation “Working with NVIDIA GPUs“.

Why is my pytorch/tensorflow using CPU only?

The most common mistake is to build the conda/python-venv environment on a fontend and not on a cluster node with GPU.

You can start an interactive job on Alex with salloc --gres=gpu:a40:1 --partition=a40 --time=02:00:00
and on TinyGPU with salloc.tinygpu --gres=gpu:1 --time=02:00:00

HPC-Portal

How can I access a link to monitoring.nhr.fau.de?

For HPC portal users (i.e., who have accounts without a password), the job-specific monitoring of ClusterCockpit is only accessible via the HPC portal.

I cannot find my HPC account in the HPC portal

I received an invite mail from the HPC-Portal but there is no account data visible

Please check the email address that is transmitted via SSO.

Login to the portal and click on your SSO name in the upper right corner. Go to “Profile” and check the email address that is visible on the left side below “Personal data”.

We can only match invitations that have been sent to the email address shown in your profile. If you see an empty account, please ask for the invitation to be resend to the correct email address.

Login

How can I access the cluster frontends?

Almost all HPC systems at NHR@FAU use private IP addresses that can only be accessed directly from within the FAU.There are three options for logging in to the clusters from outside the university:

Use a VPN (Virtual Private Network) connection.
Use IPv6. The cluster frontends have world-visible IPv6 addresses.
Use our “dialog server” cshpc.rrze.fau.de. The dialog server is the only HPC machine with a public IPv4 address. cshpc is a Linux system that is open to all HPC accounts. From this machine, you can log into any NHR@FAU system. A more complete description can be found on our documentation pages.

Whichever option you choose, you need to use SSH. Please consult our extensive SSH documentation pages for details.

How can I run my job on the cluster?

Please do not run your jobs on the cluster frontends!

Please consult our documentation for details about Batch Processing.

We also provide general job script examples for parallel jobs and GPU jobs; however, we have also prepared more specific job scripts for applications that our users frequently run on the clusters.

Miscellaneous

I would like to cross use data between HPC accounts

Please be careful when dealing with permissions!

see: man nfs4_acl/nfs4_getfacl/nfs4_setfacl/chmod

This might be the case if you own a Tier2 and a Tier3 account or to share data with a co-worker (group member).

In the following example USER1 will grand access for USER2:

You can add an entire group with
nfs4_setfacl -a A:g:GROUP@rrze.uni-erlangen.de:RX FOLDER

You may have to modify umask to ensure that new files are created with proper permissions.

USER1@frontend: ls -alh /home/woody/$GROUP/$USER/testing
-rwxr-xr-x 1 USER1 GROUP 115 Jan 11 14:24 testing

How to acknowledge resource usage

In general, use the following formulation in publications for acknowledging the resources and the support of NHR@FAU:

for the FAU Tier3 resources: “The authors gratefully acknowledge the scientific support and HPC resources provided by the Erlangen National High Performance Computing Center (NHR@FAU) of the Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU). The hardware is funded by the German Research Foundation (DFG).”
for the NHR@FAU resources/projects: “The authors gratefully acknowledge the scientific support and HPC resources provided by the Erlangen National High Performance Computing Center (NHR@FAU) of the Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU) under the NHR project <ID of your NHR@FAU project>. NHR funding is provided by federal and Bavarian state authorities. NHR@FAU hardware is partially funded by the German Research Foundation (DFG) – 440719683.”

(Also do not forget to send a copy of your papers to nhr-redaktion@lists.fau.de!)

I cannot find my HPC account in the HPC portal

I received an invite mail from the HPC-Portal but there is no account data visible

Please check the email address that is transmitted via SSO.

Login to the portal and click on your SSO name in the upper right corner. Go to “Profile” and check the email address that is visible on the left side below “Personal data”.

We can only match invitations that have been sent to the email address shown in your profile. If you see an empty account, please ask for the invitation to be resend to the correct email address.

Password

SSH is asking for a password

What is the password of my HPC account?

Python/Conda

How to fix conda error NoWriteEnvsDirError

You are missing some configuration of conda.

Please set conda config --add envs_dirs $WORK/software/privat/conda/envss and check with conda info that your configuration was successful.

How to fix conda error NoWritePkgsDirError

You are missing some configuration of conda.

Please set conda config --add pkgs_dirs $WORK/software/privat/conda/pkgs and check with conda info that your configuration was successful.

I have to analyze over 2 million files in my job. What can I do?

Please go through the presentation we provide on Using File Systems Properly; there is also a video recording available.

Why does my program give a http/https timeout?

Some software may expect instead

export HTTP_PROXY=http://proxy:80
export HTTPS_PROXY=http://proxy:80

Why is my code not running on the GPU in TinyGPU?

Please have look at our documentation about “Working with NVIDIA GPUs“.

Why is my pytorch/tensorflow using CPU only?

The most common mistake is to build the conda/python-venv environment on a fontend and not on a cluster node with GPU.

You can start an interactive job on Alex with salloc --gres=gpu:a40:1 --partition=a40 --time=02:00:00
and on TinyGPU with salloc.tinygpu --gres=gpu:1 --time=02:00:00

SSH

Debugging SSH problems

The get more information on SSH problems, add the “-v” option to SSH. This will give moderate debug information, e.g. show which SSH keys are tried.

Here is a sample output

max@notebook:~$ ssh -v cshpc
OpenSSH_7.6p1 Ubuntu-4ubuntu0.7, OpenSSL 1.0.2n 7 Dec 2017
debug1: Reading configuration data /home/max/.ssh/config
debug1: Reading configuration data /etc/ssh/ssh_config
debug1: /etc/ssh/ssh_config line 19: Applying options for *
debug1: Connecting to cshpc.rrze.uni-erlangen.de [131.188.3.39] port 22.
debug1: Connection established.
debug1: identity file /home/max/.ssh/id_rsa type 0
debug1: key_load_public: No such file or directory
debug1: identity file /home/max/.ssh/id_ecdsa type -1
debug1: key_load_public: No such file or directory
debug1: Local version string SSH-2.0-OpenSSH_7.6p1 Ubuntu-4ubuntu0.7
debug1: Remote protocol version 2.0, remote software version OpenSSH_8.2p1 Ubuntu-4ubuntu0.5
debug1: match: OpenSSH_8.2p1 Ubuntu-4ubuntu0.5 pat OpenSSH* compat 0x04000000
debug1: Authenticating to cshpc.rrze.uni-erlangen.de:22 as 'unrz143'
debug1: SSH2_MSG_KEXINIT sent
debug1: SSH2_MSG_KEXINIT received
debug1: kex: algorithm: curve25519-sha256
debug1: kex: host key algorithm: ecdsa-sha2-nistp256
debug1: kex: server->client cipher: chacha20-poly1305@openssh.com MAC: <implicit> compression: none
debug1: kex: client->server cipher: chacha20-poly1305@openssh.com MAC: <implicit> compression: none
debug1: expecting SSH2_MSG_KEX_ECDH_REPLY
debug1: Server host key: ecdsa-sha2-nistp256 SHA256:wFaDywle3yJvygQ4ZAPDsi/iSBTaF6Uoo0i0z727aJU
debug1: Host 'cshpc.rrze.uni-erlangen.de' is known and matches the ECDSA host key.
debug1: Found key in /home/max/.ssh/known_hosts:215
debug1: rekey after 134217728 blocks
debug1: SSH2_MSG_NEWKEYS sent
debug1: expecting SSH2_MSG_NEWKEYS
debug1: SSH2_MSG_NEWKEYS received
debug1: rekey after 134217728 blocks
debug1: SSH2_MSG_EXT_INFO received
debug1: kex_input_ext_info: server-sig-algs=<ssh-ed25519,sk-ssh-ed25519@openssh.com,ssh-rsa,rsa-sha2-256,rsa-sha2-512,ssh-dss,ecdsa-sha2-nistp256,ecdsa-sha2-nistp384,ecdsa-sha2-nistp521,sk-ecdsa-sha2-nistp256@openssh.com>
debug1: SSH2_MSG_SERVICE_ACCEPT received
debug1: Authentications that can continue: publickey,password
debug1: Next authentication method: publickey
debug1: Offering public key: RSA SHA256:mWO4eYar1/JYn8MDB0DPer+ibB/QatmhxvvngfaoMgQ /home/max/.ssh/id_rsa
debug1: Server accepts key: pkalg rsa-sha2-512 blen 277
debug1: Authentication succeeded (publickey).
Authenticated to cshpc.rrze.uni-erlangen.de ([131.188.3.39]:22).
debug1: channel 0: new [client-session]
debug1: Requesting no-more-sessions@openssh.com
debug1: Entering interactive session.
debug1: pledge: network
debug1: client_input_global_request: rtype hostkeys-00@openssh.com want_reply 0
debug1: Sending environment.
debug1: Sending env LC_ALL = en_US.UTF-8
debug1: Sending env LANG = en_US.UTF-8

To check the fingerprint of your SSH key, use

max@notebook:~$ ssh-keygen -l -f ~/.ssh/id_rsa
2048 SHA256:mWO4eYar1/JYn8MDB0DPer+ibB/QatmhxvvngfaoMgQ max@notebook (RSA)

This fingerprint must also match the data shown in the HPC portal (if your SSH keys are managed by the HPC portal).

In the debug output I find the following

This message is miss leading. “sign_and_send_pubkey: signing failed … agent refused operation” typically means that you entered a wrong passphrase for the SSH key.

How can I attach to a running Slurm job

This is an alternative to SSH-ing into your node.

I managed to log in to cshpc (with an SSH key) but get asked for a password / permission denied when continuing to a cluster frontend

There are a couple of solutions to mitigate that:

Use the “jump host”/”proxy jump” feature of SSH and directly connect to the cluster frontends through the dialog server cshpc. To do this, either use the command line option “-j” of recent SSH versions or use an ~/.ssh/config file on your local computer. See https://hpc.fau.de/systems-services/documentation-instructions/ssh-secure-shell-access-to-hpc-systems/#ssh_config_hpc_portal for templates.
Create an additional SSH key pair on cshpc and add the corresponding SSH public key to the HPC portal (if your account is already managed trough the HPC portal) – or add it to ~/.ssh/authorized_keys (which will only be a temporary solution until all HPC accounts are migrated to the HPC portal)
Use an SSH agent on your local computer and allow it to forward its connection to our dialog server cshpc.

All there ways make sure that cshpc has a SSH private key available when connecting to the cluster frontends.

My HPC account just has been created but I cannot login or Slurm rejects my jobs

My just updated SSH key (from the HPC portal) or password (from the IdM passwort) is not accepted

The same is true for the propagation of HPC password for accounts created through the FAU IdM portal using paper applications.

SSH is asking for a password

What is the password of my HPC account?

Slurm

How can I attach to a running Slurm job

This is an alternative to SSH-ing into your node.

My HPC account just has been created but I cannot login or Slurm rejects my jobs

Slurm options get ignored when given as sbatch command line arguments

I give some Slurm options as command line arguments to sbatch, but they are ignored!?

The syntax of sbatch is: sbatch [OPTIONS(0)...] [ : [OPTIONS(N)...]] script(0) [args(0)...]

Thus, options for sbatch have to be given before the batch script. Arguments given after the batch script are used as arguments for the batch script and not for sbatch.

Why is my pytorch/tensorflow using CPU only?

The most common mistake is to build the conda/python-venv environment on a fontend and not on a cluster node with GPU.

You can start an interactive job on Alex with salloc --gres=gpu:a40:1 --partition=a40 --time=02:00:00
and on TinyGPU with salloc.tinygpu --gres=gpu:1 --time=02:00:00

Software environment

Error “module: command not found”

If the module command cannot be found that usually means that you did not invoke the bash shell with the option “-l” (lower case L).

Thus, job scripts, etc. should always start with

#!/bin/bash -l

How can I run my job on the cluster?

Please do not run your jobs on the cluster frontends!

Please consult our documentation for details about Batch Processing.

We also provide general job script examples for parallel jobs and GPU jobs; however, we have also prepared more specific job scripts for applications that our users frequently run on the clusters.

How to fix conda error NoWriteEnvsDirError

You are missing some configuration of conda.

Please set conda config --add envs_dirs $WORK/software/privat/conda/envss and check with conda info that your configuration was successful.

How to fix conda error NoWritePkgsDirError

You are missing some configuration of conda.

Please set conda config --add pkgs_dirs $WORK/software/privat/conda/pkgs and check with conda info that your configuration was successful.

The software I need is not installed. What can I do now?

For information on how to use the modules system, please have a look into the respective section in our documentation about “Software environment“.

Even more packages will become visible once one of the 000-all-spack-pkgs modules has been loaded. Most of the software is installed using “Spack“ as enhanced HPC package manager.

Why does my program give a http/https timeout?

Some software may expect instead

export HTTP_PROXY=http://proxy:80
export HTTPS_PROXY=http://proxy:80

Why is my code not running on the GPU in TinyGPU?

Please have look at our documentation about “Working with NVIDIA GPUs“.

Why is my code not using the GPU on Alex?

CUDA is not installed as part of the OS – you have to load a cuda module for your binaries to find libcublas, libcudnn, etc.

Please also have a look at our documentation “Working with NVIDIA GPUs“.

TinyGPU Cluster

How can I attach to a running Slurm job

This is an alternative to SSH-ing into your node.

How can I leverage node-local storage on TinyGPU to increase job performance?

Each node has at least 880 GB of local SSD capacity for temporary files under $TMPDIR.

The directory $TMPDIR will be deleted automatically as soon as the user has no jobs running on the node any more.

Data to be kept can be copied to a cluster-wide volume at the end of the job.

Please also read our documentation on “File Systems“.

How can I request an interactive job on TinyGPU?

Interactive Slurm Shell (RTX2080Ti, RTX3080, V100 and A100 nodes only)

To generate an interactive Slurm shell on one of the compute nodes, the following command has to be issued on the woody frontend:
salloc.tinygpu --gres=gpu:1 --time=00:30:00

This and more information can be found in our documentation about TinyGPU.

How can I run my job on the cluster?

Please do not run your jobs on the cluster frontends!

Please consult our documentation for details about Batch Processing.

We also provide general job script examples for parallel jobs and GPU jobs; however, we have also prepared more specific job scripts for applications that our users frequently run on the clusters.

I have to analyze over 2 million files in my job. What can I do?

Please go through the presentation we provide on Using File Systems Properly; there is also a video recording available.

The software I need is not installed. What can I do now?

For information on how to use the modules system, please have a look into the respective section in our documentation about “Software environment“.

Even more packages will become visible once one of the 000-all-spack-pkgs modules has been loaded. Most of the software is installed using “Spack“ as enhanced HPC package manager.

Why is my code not running on the GPU in TinyGPU?

Please have look at our documentation about “Working with NVIDIA GPUs“.

Why is my pytorch/tensorflow using CPU only?

The most common mistake is to build the conda/python-venv environment on a fontend and not on a cluster node with GPU.

You can start an interactive job on Alex with salloc --gres=gpu:a40:1 --partition=a40 --time=02:00:00
and on TinyGPU with salloc.tinygpu --gres=gpu:1 --time=02:00:00

Woody Cluster

How can I request an interactive job on Woody-NG?

Interactive jobs can be requested by using salloc and specifying the respective options on the command line.

The following will give you an interactive shell on one node with one core dedicated to you for one hour:
salloc -n 1 --time=01:00:00

Settings from the calling shell (e.g. loaded module paths) will be inherited by the interactive job!

This and more information can be found in our documentation about Woody-NG.

How can I run my job on the cluster?

Please do not run your jobs on the cluster frontends!

Please consult our documentation for details about Batch Processing.

We also provide general job script examples for parallel jobs and GPU jobs; however, we have also prepared more specific job scripts for applications that our users frequently run on the clusters.

The software I need is not installed. What can I do now?

For information on how to use the modules system, please have a look into the respective section in our documentation about “Software environment“.

Even more packages will become visible once one of the 000-all-spack-pkgs modules has been loaded. Most of the software is installed using “Spack“ as enhanced HPC package manager.

Software

Alex Cluster

How can I attach to a running Slurm job

This is an alternative to SSH-ing into your node.

I have to analyze over 2 million files in my job. What can I do?

Please go through the presentation we provide on Using File Systems Properly; there is also a video recording available.

The software I need is not installed. What can I do now?

For information on how to use the modules system, please have a look into the respective section in our documentation about “Software environment“.

Even more packages will become visible once one of the 000-all-spack-pkgs modules has been loaded. Most of the software is installed using “Spack“ as enhanced HPC package manager.

Why is my code not using the GPU on Alex?

CUDA is not installed as part of the OS – you have to load a cuda module for your binaries to find libcublas, libcudnn, etc.

Please also have a look at our documentation “Working with NVIDIA GPUs“.

Why is my pytorch/tensorflow using CPU only?

The most common mistake is to build the conda/python-venv environment on a fontend and not on a cluster node with GPU.

You can start an interactive job on Alex with salloc --gres=gpu:a40:1 --partition=a40 --time=02:00:00
and on TinyGPU with salloc.tinygpu --gres=gpu:1 --time=02:00:00

Basic HPC Knowledge

I have to analyze over 2 million files in my job. What can I do?

Please go through the presentation we provide on Using File Systems Properly; there is also a video recording available.

Why does my program give a http/https timeout?

Some software may expect instead

export HTTP_PROXY=http://proxy:80
export HTTPS_PROXY=http://proxy:80

Why is my pytorch/tensorflow using CPU only?

The most common mistake is to build the conda/python-venv environment on a fontend and not on a cluster node with GPU.

You can start an interactive job on Alex with salloc --gres=gpu:a40:1 --partition=a40 --time=02:00:00
and on TinyGPU with salloc.tinygpu --gres=gpu:1 --time=02:00:00

Batch System

How can I attach to a running Slurm job

This is an alternative to SSH-ing into your node.

I have to analyze over 2 million files in my job. What can I do?

Please go through the presentation we provide on Using File Systems Properly; there is also a video recording available.

CUDA

Why is my code not running on the GPU in TinyGPU?

Please have look at our documentation about “Working with NVIDIA GPUs“.

Why is my code not using the GPU on Alex?

CUDA is not installed as part of the OS – you have to load a cuda module for your binaries to find libcublas, libcudnn, etc.

Please also have a look at our documentation “Working with NVIDIA GPUs“.

Why is my pytorch/tensorflow using CPU only?

The most common mistake is to build the conda/python-venv environment on a fontend and not on a cluster node with GPU.

You can start an interactive job on Alex with salloc --gres=gpu:a40:1 --partition=a40 --time=02:00:00
and on TinyGPU with salloc.tinygpu --gres=gpu:1 --time=02:00:00

Cluster Access

How can I attach to a running Slurm job

This is an alternative to SSH-ing into your node.

The software I need is not installed. What can I do now?

For information on how to use the modules system, please have a look into the respective section in our documentation about “Software environment“.

Even more packages will become visible once one of the 000-all-spack-pkgs modules has been loaded. Most of the software is installed using “Spack“ as enhanced HPC package manager.

Continuous X

What is Continuous Benchmarking (CB)?

CB can be seen as a variant of CT, where not only functionality but also performance is tested in order to avoid regressions, i.e., unwanted performance degradation due to code changes.

Please also see our documentation on “Continuous Integration / Gitlab Cx“.

What is Continuous Deployment (CD)?

CD is the automatic deployment of the software coming out of the other Cx processes. This can be the installation on a particular system, rolling out a revision within a whole organization, pushing installation packages to public repositories, etc.

Please also see our documentation on “Continuous Integration / Gitlab Cx“.

What is Continuous Integration (CI)?

Continuous Integration is the practice of automatically integrating code changes into a software project. It relies on a code repository that supports automated building and testing. Often, CI also involves setting up a build system from scratch, including all dependencies.

Please also see our documentation on “Continuous Integration / Gitlab Cx“.

What is Continuous Testing (CT)?

It is the practice of executing automated tests as an integral part of the software development process. It tries to make sure that no functionality is lost and no errors are introduced during development.

Please also see our documentation on “Continuous Integration / Gitlab Cx“.

FileSystems/Data Storage

I have to analyze over 2 million files in my job. What can I do?

Please go through the presentation we provide on Using File Systems Properly; there is also a video recording available.

Fritz Cluster

The software I need is not installed. What can I do now?

For information on how to use the modules system, please have a look into the respective section in our documentation about “Software environment“.

Even more packages will become visible once one of the 000-all-spack-pkgs modules has been loaded. Most of the software is installed using “Spack“ as enhanced HPC package manager.

GPU usage

How can I attach to a running Slurm job

This is an alternative to SSH-ing into your node.

Why is my code not running on the GPU in TinyGPU?

Please have look at our documentation about “Working with NVIDIA GPUs“.

Why is my code not using the GPU on Alex?

CUDA is not installed as part of the OS – you have to load a cuda module for your binaries to find libcublas, libcudnn, etc.

Please also have a look at our documentation “Working with NVIDIA GPUs“.

Why is my pytorch/tensorflow using CPU only?

The most common mistake is to build the conda/python-venv environment on a fontend and not on a cluster node with GPU.

You can start an interactive job on Alex with salloc --gres=gpu:a40:1 --partition=a40 --time=02:00:00
and on TinyGPU with salloc.tinygpu --gres=gpu:1 --time=02:00:00

Python/Conda

How to fix conda error NoWriteEnvsDirError

You are missing some configuration of conda.

Please set conda config --add envs_dirs $WORK/software/privat/conda/envss and check with conda info that your configuration was successful.

How to fix conda error NoWritePkgsDirError

You are missing some configuration of conda.

Please set conda config --add pkgs_dirs $WORK/software/privat/conda/pkgs and check with conda info that your configuration was successful.

I have to analyze over 2 million files in my job. What can I do?

Please go through the presentation we provide on Using File Systems Properly; there is also a video recording available.

Why does my program give a http/https timeout?

Some software may expect instead

export HTTP_PROXY=http://proxy:80
export HTTPS_PROXY=http://proxy:80

Why is my code not running on the GPU in TinyGPU?

Please have look at our documentation about “Working with NVIDIA GPUs“.

Why is my pytorch/tensorflow using CPU only?

The most common mistake is to build the conda/python-venv environment on a fontend and not on a cluster node with GPU.

You can start an interactive job on Alex with salloc --gres=gpu:a40:1 --partition=a40 --time=02:00:00
and on TinyGPU with salloc.tinygpu --gres=gpu:1 --time=02:00:00

SSH

How can I attach to a running Slurm job

This is an alternative to SSH-ing into your node.

Slurm

How can I attach to a running Slurm job

This is an alternative to SSH-ing into your node.

Why is my pytorch/tensorflow using CPU only?

The most common mistake is to build the conda/python-venv environment on a fontend and not on a cluster node with GPU.

You can start an interactive job on Alex with salloc --gres=gpu:a40:1 --partition=a40 --time=02:00:00
and on TinyGPU with salloc.tinygpu --gres=gpu:1 --time=02:00:00

Software environment

How to fix conda error NoWriteEnvsDirError

You are missing some configuration of conda.

Please set conda config --add envs_dirs $WORK/software/privat/conda/envss and check with conda info that your configuration was successful.

How to fix conda error NoWritePkgsDirError

You are missing some configuration of conda.

Please set conda config --add pkgs_dirs $WORK/software/privat/conda/pkgs and check with conda info that your configuration was successful.

The software I need is not installed. What can I do now?

For information on how to use the modules system, please have a look into the respective section in our documentation about “Software environment“.

Even more packages will become visible once one of the 000-all-spack-pkgs modules has been loaded. Most of the software is installed using “Spack“ as enhanced HPC package manager.

Why does my program give a http/https timeout?

Some software may expect instead

export HTTP_PROXY=http://proxy:80
export HTTPS_PROXY=http://proxy:80

Why is my code not running on the GPU in TinyGPU?

Please have look at our documentation about “Working with NVIDIA GPUs“.

Why is my code not using the GPU on Alex?

CUDA is not installed as part of the OS – you have to load a cuda module for your binaries to find libcublas, libcudnn, etc.

Please also have a look at our documentation “Working with NVIDIA GPUs“.

TinyGPU Cluster

How can I attach to a running Slurm job

This is an alternative to SSH-ing into your node.

I have to analyze over 2 million files in my job. What can I do?

Please go through the presentation we provide on Using File Systems Properly; there is also a video recording available.

The software I need is not installed. What can I do now?

For information on how to use the modules system, please have a look into the respective section in our documentation about “Software environment“.

Even more packages will become visible once one of the 000-all-spack-pkgs modules has been loaded. Most of the software is installed using “Spack“ as enhanced HPC package manager.

Why is my code not running on the GPU in TinyGPU?

Please have look at our documentation about “Working with NVIDIA GPUs“.

Why is my pytorch/tensorflow using CPU only?

The most common mistake is to build the conda/python-venv environment on a fontend and not on a cluster node with GPU.

You can start an interactive job on Alex with salloc --gres=gpu:a40:1 --partition=a40 --time=02:00:00
and on TinyGPU with salloc.tinygpu --gres=gpu:1 --time=02:00:00

Woody Cluster

The software I need is not installed. What can I do now?

For information on how to use the modules system, please have a look into the respective section in our documentation about “Software environment“.

Even more packages will become visible once one of the 000-all-spack-pkgs modules has been loaded. Most of the software is installed using “Spack“ as enhanced HPC package manager.

Hardware

Alex Cluster

How can I attach to a running Slurm job

This is an alternative to SSH-ing into your node.

I have to analyze over 2 million files in my job. What can I do?

Please go through the presentation we provide on Using File Systems Properly; there is also a video recording available.

Basic HPC Knowledge

I have to analyze over 2 million files in my job. What can I do?

Please go through the presentation we provide on Using File Systems Properly; there is also a video recording available.

Is it true that Arm processors are now competitive in HPC?

“Arm CPU” just means that it uses an instruction set architecture (ISA) licensed from Arm, but the hardware implementation can vary a lot. There is a plethora of Arm-based designs, some from Arm and many from other vendors. Many target the low-power/embedded market, but there are some which have entered the HPC area. Prominent examples are the Fujitsu A64FX and the Marvell ThunderX2. A TX2 system is available in the NHR test and benchmark cluster.

What is a parallel file system?

For information on how to use a parallel file system on our clusters, please read our documentation on “Parallel file systems $FASTTMP“.

What is a vector computer?

A vector computer has an ISA and CPU architecture that enable efficient operations on array data. This goes under the name of Single Instruction Multiple Data (SIMD). SIMD features have proliferated in commodity CPUs as well, but a true vector CPU has features that make it more efficient, such as large vector lengths (e.g., 256 elements) and a high memory bandwidth (e.g., 1.5 Tbyte/s). Currently, only NEC offers a true vector processor, the SX-Aurora Tsubasa. A node with two Tsubasa cards is available in the NHR test and benchmark cluster.

Why should I care about file systems?

Batch System

How can I attach to a running Slurm job

This is an alternative to SSH-ing into your node.

I have to analyze over 2 million files in my job. What can I do?

Please go through the presentation we provide on Using File Systems Properly; there is also a video recording available.

Cluster Access

How can I attach to a running Slurm job

This is an alternative to SSH-ing into your node.

FileSystems/Data Storage

How can I leverage node-local storage on TinyGPU to increase job performance?

Each node has at least 880 GB of local SSD capacity for temporary files under $TMPDIR.

The directory $TMPDIR will be deleted automatically as soon as the user has no jobs running on the node any more.

Data to be kept can be copied to a cluster-wide volume at the end of the job.

Please also read our documentation on “File Systems“.

I have to analyze over 2 million files in my job. What can I do?

Please go through the presentation we provide on Using File Systems Properly; there is also a video recording available.

What is a parallel file system?

For information on how to use a parallel file system on our clusters, please read our documentation on “Parallel file systems $FASTTMP“.

What is file system metadata?

Metadata comprises all the bookkeeping information in a file system: file sizes, permissions, modification and access times, etc. A workload that, e.g., opens and closes files in rapid succession leads to frequent metadata accesses, putting a lot of strain on any file server infrastructure. This is why a small number of users with “toxic” workload can slow down file operations to a crawl for everyone. Note also that especially parallel file systems are ill-suited for metadata-heavy operations.

Where can I store my data?

Your home directory is accessible via $HOME. Each user gets a standard quota of 50 Gigabytes and quota extensions are not possible.

Additional storage is accessible via $HPCVAULT. Here, the default quota for each user is 500 Gigabytes.

The recommended work directory is accessible via $WORK. The standard quota for each user is 500 Gigabytes.

All three directories ($HOME, $HPCVAULT and $WORK) are available throughout our HPC systems.

We recommend you use the aforementioned variables in your jobscripts and not rely on the specific paths as this may change over time, i.e. when directories are relocated to a different NFS server.

Please also have a look into our documentation on “File Systems“.

Why should I care about file systems?

Why the need for several file systems?

For further information, please consult our documentation on “File systems“.

GPU usage

How can I attach to a running Slurm job

This is an alternative to SSH-ing into your node.

Miscellaneous

Is it true that Arm processors are now competitive in HPC?

What is a vector computer?

Python/Conda

I have to analyze over 2 million files in my job. What can I do?

Please go through the presentation we provide on Using File Systems Properly; there is also a video recording available.

SSH

How can I attach to a running Slurm job

This is an alternative to SSH-ing into your node.

Slurm

How can I attach to a running Slurm job

This is an alternative to SSH-ing into your node.

Test Cluster

Is it true that Arm processors are now competitive in HPC?

What is a vector computer?

TinyGPU Cluster

How can I attach to a running Slurm job

This is an alternative to SSH-ing into your node.

How can I leverage node-local storage on TinyGPU to increase job performance?

Each node has at least 880 GB of local SSD capacity for temporary files under $TMPDIR.

The directory $TMPDIR will be deleted automatically as soon as the user has no jobs running on the node any more.

Data to be kept can be copied to a cluster-wide volume at the end of the job.

Please also read our documentation on “File Systems“.

I have to analyze over 2 million files in my job. What can I do?

Please go through the presentation we provide on Using File Systems Properly; there is also a video recording available.

General information

Acknowledgement

How to acknowledge resource usage

In general, use the following formulation in publications for acknowledging the resources and the support of NHR@FAU:

for the FAU Tier3 resources: “The authors gratefully acknowledge the scientific support and HPC resources provided by the Erlangen National High Performance Computing Center (NHR@FAU) of the Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU). The hardware is funded by the German Research Foundation (DFG).”
for the NHR@FAU resources/projects: “The authors gratefully acknowledge the scientific support and HPC resources provided by the Erlangen National High Performance Computing Center (NHR@FAU) of the Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU) under the NHR project <ID of your NHR@FAU project>. NHR funding is provided by federal and Bavarian state authorities. NHR@FAU hardware is partially funded by the German Research Foundation (DFG) – 440719683.”

(Also do not forget to send a copy of your papers to nhr-redaktion@lists.fau.de!)

Alex Cluster

How can I access the new clusters Alex and Fritz?

If you do not have an HPC account, please follow our instructions on “Getting started with HPC“.

External scientists have to submit a NHR proposal to get access.

Basic HPC Knowledge

Is it true that Arm processors are now competitive in HPC?

What is a vector computer?

What is SMT (also known as hyperthreading)?

What is thread or process affinity?

Why does my program give a http/https timeout?

Some software may expect instead

export HTTP_PROXY=http://proxy:80
export HTTPS_PROXY=http://proxy:80

Cluster Access

How can I access the cluster frontends?

Almost all HPC systems at NHR@FAU use private IP addresses that can only be accessed directly from within the FAU.There are three options for logging in to the clusters from outside the university:

Use a VPN (Virtual Private Network) connection.
Use IPv6. The cluster frontends have world-visible IPv6 addresses.
Use our “dialog server” cshpc.rrze.fau.de. The dialog server is the only HPC machine with a public IPv4 address. cshpc is a Linux system that is open to all HPC accounts. From this machine, you can log into any NHR@FAU system. A more complete description can be found on our documentation pages.

Whichever option you choose, you need to use SSH. Please consult our extensive SSH documentation pages for details.

How can I access the new clusters Alex and Fritz?

If you do not have an HPC account, please follow our instructions on “Getting started with HPC“.

External scientists have to submit a NHR proposal to get access.

How to acknowledge resource usage

In general, use the following formulation in publications for acknowledging the resources and the support of NHR@FAU:

for the FAU Tier3 resources: “The authors gratefully acknowledge the scientific support and HPC resources provided by the Erlangen National High Performance Computing Center (NHR@FAU) of the Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU). The hardware is funded by the German Research Foundation (DFG).”
for the NHR@FAU resources/projects: “The authors gratefully acknowledge the scientific support and HPC resources provided by the Erlangen National High Performance Computing Center (NHR@FAU) of the Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU) under the NHR project <ID of your NHR@FAU project>. NHR funding is provided by federal and Bavarian state authorities. NHR@FAU hardware is partially funded by the German Research Foundation (DFG) – 440719683.”

(Also do not forget to send a copy of your papers to nhr-redaktion@lists.fau.de!)

I cannot find my HPC account in the HPC portal

I received an invite mail from the HPC-Portal but there is no account data visible

Please check the email address that is transmitted via SSO.

Login to the portal and click on your SSO name in the upper right corner. Go to “Profile” and check the email address that is visible on the left side below “Personal data”.

We can only match invitations that have been sent to the email address shown in your profile. If you see an empty account, please ask for the invitation to be resend to the correct email address.

ClusterCockpit

How can I access a link to monitoring.nhr.fau.de?

For HPC portal users (i.e., who have accounts without a password), the job-specific monitoring of ClusterCockpit is only accessible via the HPC portal.

Contact

How can I contact the HPC team?

An informal and low-threshold way to talk to members of the HPC team is our regular HPC Café. The HPC Café takes place every second Tuesday of the month at 4:00 p.m. in seminar room 2.049 at RRZE, Martensstr.1, 91058 Erlangen. Due to the Covid-19 pandemic, the HPC Café was replaced by an online consultation hour since early 2020. Details are published on the HPC Café website.

Note: Currently we mostly work from home and thus cannot be reached via our office phone numbers. We can arrange virtual appointments by Zoom, MS Teams, or BigBlueButton. You may also call RRZE’s HelpDesk (+49-9131-85-29955) and leave a message for us—but you probably get a faster response by sending us an e-mail (hpc-support@fau.de).

FileSystems/Data Storage

Where can I store my data?

Your home directory is accessible via $HOME. Each user gets a standard quota of 50 Gigabytes and quota extensions are not possible.

Additional storage is accessible via $HPCVAULT. Here, the default quota for each user is 500 Gigabytes.

The recommended work directory is accessible via $WORK. The standard quota for each user is 500 Gigabytes.

All three directories ($HOME, $HPCVAULT and $WORK) are available throughout our HPC systems.

We recommend you use the aforementioned variables in your jobscripts and not rely on the specific paths as this may change over time, i.e. when directories are relocated to a different NFS server.

Please also have a look into our documentation on “File Systems“.

Fritz Cluster

How can I access the new clusters Alex and Fritz?

If you do not have an HPC account, please follow our instructions on “Getting started with HPC“.

External scientists have to submit a NHR proposal to get access.

HPC-Portal

How can I access a link to monitoring.nhr.fau.de?

For HPC portal users (i.e., who have accounts without a password), the job-specific monitoring of ClusterCockpit is only accessible via the HPC portal.

I cannot find my HPC account in the HPC portal

I have a Tier3-project in the HPC portal. Which of the project categories is the correct one for the new account I want to add?

There are three default Tier3-project categories in the HPC portal:

###100: Tier3 Grundversorgung¹ <name of institute/department> (Prof. <name>)—that’s for staff, PhD students with a contract at FAU (“contract” does not necessarily mean “employment”), employed student assistants, etc.
###101: Studentische Abschlußarbeiten² Tier3 Grundversorgung¹ <name of institute/department> (Prof. <name>)—that’s for students doing their bachelor/master thesis or some sort of study-related project work (i.e., coursework) but usually not for tutorials of a lecture (that would be a separate category⁴); doctoral theses do not fall into this category (that would be basic supply ###100)
###102: Projektpartner³ Tier3 Grundversorgung¹ <name of institute/department> (Prof. <name>)—that’s for people not belonging to FAU, i.e., all people who should have access to HPC resources as part of your Tier3 HPC basic provision but who are neither students nor employees/scholarship holders/etc. at FAU; guest researchers or former employees, for example, fall into this category

[1] Basic supply [2] Student theses [3] Project partners
[4] In some cases, there is ###103 for teaching activities but usually, there will be a dedicated project with a separate name space.

I received an invite mail from the HPC-Portal but there is no account data visible

Please check the email address that is transmitted via SSO.

Login to the portal and click on your SSO name in the upper right corner. Go to “Profile” and check the email address that is visible on the left side below “Personal data”.

We can only match invitations that have been sent to the email address shown in your profile. If you see an empty account, please ask for the invitation to be resend to the correct email address.

Login

How can I access the cluster frontends?

Almost all HPC systems at NHR@FAU use private IP addresses that can only be accessed directly from within the FAU.There are three options for logging in to the clusters from outside the university:

Use a VPN (Virtual Private Network) connection.
Use IPv6. The cluster frontends have world-visible IPv6 addresses.
Use our “dialog server” cshpc.rrze.fau.de. The dialog server is the only HPC machine with a public IPv4 address. cshpc is a Linux system that is open to all HPC accounts. From this machine, you can log into any NHR@FAU system. A more complete description can be found on our documentation pages.

Whichever option you choose, you need to use SSH. Please consult our extensive SSH documentation pages for details.

How can I change my HPC password?

If you applied for the HPC account using the paper forms: Please log in to www.idm.fau.de with your IdM account and change the HPC password under Services.

Please note that it generally takes a few hours until password changes are known on the HPC systems, the change will not happen at the same time on all clusters.

If you got your HPC account through the new HPC portal, there is no password for the HPC account at all. Authorization at the HPC portal is done using SSO (provided by your home institute) and login to the HPC systems is by SSH keys (which have to be uploaded through the HPC portal). Information on how to upload SSH keys to the HPC portal can be found in our documentation on the HPC portal.

Miscellaneous

How to acknowledge resource usage

In general, use the following formulation in publications for acknowledging the resources and the support of NHR@FAU:

for the FAU Tier3 resources: “The authors gratefully acknowledge the scientific support and HPC resources provided by the Erlangen National High Performance Computing Center (NHR@FAU) of the Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU). The hardware is funded by the German Research Foundation (DFG).”
for the NHR@FAU resources/projects: “The authors gratefully acknowledge the scientific support and HPC resources provided by the Erlangen National High Performance Computing Center (NHR@FAU) of the Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU) under the NHR project <ID of your NHR@FAU project>. NHR funding is provided by federal and Bavarian state authorities. NHR@FAU hardware is partially funded by the German Research Foundation (DFG) – 440719683.”

(Also do not forget to send a copy of your papers to nhr-redaktion@lists.fau.de!)

I cannot find my HPC account in the HPC portal

I heard that RISC-V is “the new thing”.

RISC-V is just a modern, open instruction set architecture (ISA) that does not have the licensing issues of Arm. However, the underlying processor architecture will mainly determine the performance of code. So far, competitive RISC-V designs are nowhere to be seen in HPC, but this may change in the future.

I received an invite mail from the HPC-Portal but there is no account data visible

Please check the email address that is transmitted via SSO.

Login to the portal and click on your SSO name in the upper right corner. Go to “Profile” and check the email address that is visible on the left side below “Personal data”.

We can only match invitations that have been sent to the email address shown in your profile. If you see an empty account, please ask for the invitation to be resend to the correct email address.

Is it true that Arm processors are now competitive in HPC?

What about the Apple M1?

It’s positively impressive, in terms of memory bandwidth as well as the architecture of its memory hierarchy. Current models still lack the peak performance needed to be competitive with x86 server CPUs, however.

What is a vector computer?

Password

How can I change my HPC password?

If you applied for the HPC account using the paper forms: Please log in to www.idm.fau.de with your IdM account and change the HPC password under Services.

Please note that it generally takes a few hours until password changes are known on the HPC systems, the change will not happen at the same time on all clusters.

Python/Conda

Why does my program give a http/https timeout?

Some software may expect instead

export HTTP_PROXY=http://proxy:80
export HTTPS_PROXY=http://proxy:80

Software environment

Why does my program give a http/https timeout?

Some software may expect instead

export HTTP_PROXY=http://proxy:80
export HTTPS_PROXY=http://proxy:80

Test Cluster

Is it true that Arm processors are now competitive in HPC?

What is a vector computer?