Python and Jupyter
Jupyterhub was the topic of the HPC Cafe in October 2020.
You can find our new jupyterhub at https://hub.hpc.fau.de/jupyter/. This new instance also has more GPU ressources available.
If you have an HPC account without password, i.e. managed through the new “HPC portal”, use the link provided in the HPC portal in the “External Tools” section on the User tab to access our new Jupyterhub instance.
This page will address some common pitfalls when working with python and related tools on a shared system like a cluster.
The following topics will be discussed in detail on this page:
- Available python versions
- Installing packages
- Conda environment
- Jupyter notebook security
- Installation and usage of mpi4py under Conda
Available python versions
All Unix systems come with a system-wide python installation, however for the cluster it is highly recommended to use one of the anaconda installations provided as a modules.
# reminder module avail python module load python/XY
These modules come with a wide range of preinstalled packages.
Installing packages
There are different ways of managing python packages on the cluster. This list is not complete, whoever it highlights methods which are known to work well with the local software stack.
As a general note. It is recommended to build packages using an interactive job on the target cluster to make sure all hardware can be used properly.
Make sure to load modules that might be needed by your python code (e.g. CUDA for gpu support)
set if external repositories are needed
export http_proxy=http://proxy:80
export https_proxy=http://proxy:80
Using pip
Pip is a package manager for python. It can be used to easily install packages and manage their versions.
By default pip will try to install packages system wide, which will not be possible due to missing permissions.
The behavior can be changed by adding --user
to the call.
pip install --user package-name
or %pip install --user --proxy http://proxy:80 package-name
from within Jupiter-notebooks
By defining the variable PYTHONUSERBASE
(best done in your bashrc/bash_profile) we change the installation location from ~/.local to a different path. Doing so will prevent your home folder from cluttering with stuff that does not need a backup and hitting the quota.
export PYTHONUSERBASE=$WORK/software/privat
If you intend to share the packages/envs with your coworkers consider wrapping the python package inside a module.
For information on the module system see your HPC-Cafe from March 2020.
Setup and define the target folder with PYTHONUSERBASE
.
Install the package as above.
Your module file needs to add to PYTHONPATH
the site-packages
folder
and to PATH
the bin
folder, if the package comes with binaries.
For an example see the module quantumtools
on woody.
Conda environment
In order to use Conda environments on the HPC cluster some preparation has to be done.
Remember a python module needs to be loaded all the time – see module avail python
.
run
conda init bash
if you use a different shell replace bash by the shell of your choice
source ~/.bashrc
if you use a different shell replace .bashrc.
The process was successful if your prompt starts with (base).
Create a ~/.profile with the content
if [ -n "$BASH_VERSION" ]; then
# include .bashrc if it exists
if [ -f "$HOME/.bashrc" ]; then
. "$HOME/.bashrc"
fi
fi
For batch jobs it might be needed to use source activate <myenv>
instead of conda activate <myenv>
Some scientific software comes in the form of a Conda environment (e.g. https://docs.gammapy.org/0.17/install/index.html).
By default such an environment will be installed to ~/.conda. However the size can be several GB, therefore you should configure Conda to a different path. This will prevent your home folder from hitting the quota. It can be done by following these steps:
conda config # create ~/.condarc
Add the following lines to the file (replace the path if you prefer a different location)
pkgs_dirs:
- ${WORK}/software/privat/conda/pkgs
envs_dirs:
- ${WORK}/software/privat/conda/envs
You can check that this configuration file is properly read by inspecting the output of conda info
For more options see https://conda.io/projects/conda/en/latest/user-guide/configuration/use-condarc.html
Conda environments can also be used for package management (and more)
You can share conda environments with co-workers by having them add your environment path to their envs_dir as well.
Create your own environment with
conda create --name myenv (python=3.9)
conda activate myenv
conda/pip install package-name
packages will end up within the conda environment therefore no --user
option is needed.
Conda environments come with the extra benefit of ease of use; with jupyterhub.rrze.uni-erlangen.de they show up as a kernel option when starting a notebook.
Jupyter notebook security
When using Jupyter notebooks with their default configuration, they are protected by a random hashed password, which in some circumstances can cause security issues on a multi-user system like cshpc or the cluster frontends. We can change this with a few configuration steps by adding a password protection.
First generate a configuration file by executing
jupyter notebook --generate-config
Open a python terminal and generate a password:
from notebook.auth import passwd; passwd()
Add the password hash to your notebook config file
# The string should be of the form type:salt:hashed-password. c.NotebookApp.password = u'' c.NotebookApp.password_required = True
From now on your notebook will be password protected. This comes with the benefit that you can use bash functions for a more convenient use.
Quick reminder how to use the remote notebook:
#start notebook on a frontend (e.g. woody) jupyter notebook --no-browser --port=XXXX
On your client, use:
ssh -f user_name@remote_server -L YYYY:localhost:XXXX -N
Open the notebook in your local browser at https://localhost:YYYY
With XXXX and YYYY being 4 digit numbers.
Don’t forget to stop the notebook once you are done. Otherwise you will block resources that could be used by others!
Some useful functions/aliases for lazy people 😉
alias remote_notebook_stop='ssh username@remote_server_ip "pkill -u username jupyter"'
Be aware this will kill all jupyter processes that you own!
start_jp_woody(){ nohup ssh -J username@cshpc.rrze.fau.de -L $1:localhost:$1 username@woody.nhr.fau.de " . /etc/bash.bashrc.local; module load python/3.7-anaconda ; jupyter notebook --port=$1 --no-browser" ; echo ""; echo " the notebook can be started in your browser at: https://localhost:$1/ " ; echo "" }
start_jp_meggie(){ nohup ssh -J username@cshpc.rrze.fau.de -L $1:localhost:$1 username@meggie.rrze.fau.de " . /etc/profile; module load python/3.7-anaconda ; jupyter notebook --port=$1 --no-browser" ; echo ""; echo " the notebook can be started in your browser at: https://localhost:$1/ " ; echo "" }
If you are using a cshell remove . /etc/bash.bashrc.local
and . /etc/profile
from the functions.
Installation and usage of mpi4py under Conda
Installing mpi4py
via pip
will install a generic MPI that will not work on our clusters. We recommend separately installing mpi4py
for each cluster through the following steps:
- If conda is not already configured and initialized follow the steps documented under Conda environment.
- For more details regarding the installation refer to the official documentation of
mpi4py
.
Note: Running MPI parallel Python scripts is only supported on the compute nodes and not on frontend nodes.
Installation
Installation must be performed on the cluster frontend node:
- Load Anaconda module.
- Load MPI module.
- Install
mpi4py
and specify the path to the MPI compiler wrapper:MPICC=$(which mpicc) pip install --no-cache-dir mpi4py
Testing the installation must be performed inside an interactive job:
- Load the Anaconda and MPI module versions
mpi4py
was build with. - Activate environment.
- Run MPI parallel Python script:
srun python -m mpi4py.bench helloworld
This should print for each process a line in the form of:
Hello, World! I am process <rank> of <size> on <hostname>
The number of processes to start is configured through the respective options of
salloc
.
Usage
MPI parallel python scripts with mpi4py
only work inside a job on a compute node.
In an interactive job or inside a job script run the following steps:
- Load the Anaconda and MPI module versions
mpi4py
was build with. - Initialize/activate environment.
- Run MPI parallel Python script
srun python <script>
The number of processes to start is configured through the respective options in the job script or of
salloc
.
For how to request an interactive job via salloc
and how to write a job script see batch processing.