Meggie parallel cluster
The RRZE’s Meggie cluster (manufacturer: Megware) is a high-performance compute resource with high speed interconnect. It is intended for distributed-memory (MPI) or hybrid parallel programs with medium to high communication requirements.
- 728 compute nodes, each with two Intel Xeon E5-2630v4 “Broadwell” chips (10 cores per chip) running at 2.2 GHz with 25 MB Shared Cache per chip and 64 GB of RAM.
- 2 front end nodes with the same CPUs as the compute nodes but 128 GB of RAM.
- Lustre-based parallel filesystem with a capacity of almost 1 PB and an aggregated parallel I/O bandwidth of > 9000 MB/s.
- Intel OmniPath interconnect with up to 100 GBit/s bandwidth per link and direction.
- Measured LINPACK performance of ~481 TFlop/s.
The name “meggie” is a play with the name of the manufacturer.
Meggie is a system that is designed for running parallel programs using significantly more than one node. Jobs with less than one node are not supported by RRZE and are subject to be killed without notice.
This website shows information regarding the following topics:
- Access, User Environment, File Systems
- Further Information
Note that access to Meggie is still restricted: If you want access to it, you will need to contact hpc@rrze and provide a short (!) description of what you want to do there.
Users can connect to
and will be randomly routed to one of the two front ends. All systems in the cluster, including the front ends, have private IPv4 addresses in the
10.28.24.0/21 and IPv6 addresses in the
2001:638:a000:3924::/64 range. They can normally only be accessed directly from within the FAU networks. There is one exception:
If your internet connection supports IPv6, you can directly ssh to the front ends (but not to the compute nodes). Otherwise, if you need access from outside of FAU, you usually have to connect for example to the dialog server
cshpc.rrze.fau.de first and then ssh to
meggie from there.
While it is possible to ssh directly to a compute node, a user is only allowed to do this while they have a batch job running there. When all batch jobs of a user on a node have ended, all of their processes, including any open shells, will be killed automatically.
The login and compute nodes run CentOS (which is basically Redhat Enterprise without the support). As on most other RRZE HPC systems, a modules environment is provided to facilitate access to software packages.
module avail” to get a list of available packages.
The shell for all users on Meggie is always
bash. This is different from our other clusters and the rest of RRZE, where the shell used to be
tcsh unless you had requested it to be changed.
The following table summarizes the available file systems and their features. It is only an excerpt from the description of the HPC file system.
|Mount point||Access via||Purpose||Technology, size||Backup||Data lifetime||Quota|
||Storage of source, input and important results||NFS on central servers, small||YES + Snapshots||Account lifetime||YES (restrictive)|
||Medium- to long-term storage||central servers, HSM||YES + Snapshots||Account lifetime||YES|
||$WOODYHOME||Short- to medium-term storage or small files||central NFS server||NO||Account lifetime||YES|
||High performance parallel I/O; short-term storage||Lustre-based parallel file system via OmniPath, almost 1 PB||NO||High watermark deletion||NO|
Please note the following differences to our older clusters:
- The nodes do not have any local hard disc drives like on previous clusters.
/tmplies in RAM, so it is absolutely NOT possible to store more than a few MB of data there
NFS file system
When connecting to one of the front end nodes, you’ll find yourself in your regular RRZE
$HOME directory (
/home/hpc/...). There are relatively tight quotas there, so it will most probably be too small for the inputs/outputs of your jobs. It however does offer a lot of nice features, like fine grained snapshots, so use it for “important” stuff, e.g. your job scripts, or the source code of the program you’re working on. See the HPC file system page for a more detailed description of the features.
Parallel file system
The cluster’s parallel file system is mounted on all nodes under
/lxfs/$GROUP/$USER/ and available via the
$FASTTMP environment variable. It supports parallel I/O using the MPI-I/O functions and can be accessed with an aggregate bandwidth of >9000 MBytes/sec (and even much larger if caching effects can be used).
The parallel file system is strictly intended to be a high-performance short-term storage, so a high watermark deletion algorithm is employed: When the filling of the file system exceeds a certain limit (e.g. 80%), files will be deleted starting with the oldest and largest files until a filling of less than 60% is reached.
Be aware that the normal
tar -x command preserves the modification time of the original file instead of the time when the archive is unpacked. So unpacked files may become one of the first candidates for deletion. Use
tar -mx or
touch in combination with
find to work around this. Be aware that the exact time of deletion is unpredictable.
Note that parallel filesystems generally are not made for handling large amounts of small files. This is by design: Parallel filesystems achieve their amazing speed by writing to multiple different servers at the same time. However, they do that in blocks, in our case 1 MB. That means that for a file that is smaller than 1 MB, only one server will ever be used, so the parallel filesystem can never be faster than a traditional NFS server – on the contrary: due to larger overhead, it will generally be slower. They can only show their strengths with files that are at least a few megabytes in size, and excel if very large files are written by many nodes simultaneous (e.g. checkpointing).
For that reason, we have set a limit on the number of files you can store there.
As with all production clusters at RRZE, resources are controlled through a batch system. The front ends can be used for compiling and very short serial test runs, but everything else has to go through the batch system to the cluster.
Meggie is RRZE’s first cluster to use SLURM as a batch system! For users of our older torque-based clusters, this means that the batch system commands have changed significantly.
Please see the batch system description for further details.
The following queues are available on this cluster:
|Partition||min – max walltime||min – max nodes||availability||Comments|
||0 – 01:00:00||1 – 8||all users||higher priority|
||0 – 24:00:00||1 – 64||all users||“Workhorse”|
||0 – 24:00:00||1 – 256||special users||Not active all the time as it causes quite some waste.
Users can get access for benchmarking or after proving they
can really make use of more than 64 nodes with their codes.
||0 – infinity||1 – all||special users||only active during/after maintenance|
There is no routing queue! If you want to take advantage of the features of a partition other than the
work partition, you have to explicitly specify this in your job script via
Eligible jobs in the
work partition will automatically take advantage of the nodes reserved for short running jobs.
As on all RRZE clusters, Intel MPI is recommended, but OpenMPI is available, too.
Due to the SLURM scheduling system, two different ways to start your MPI application are possible. Like on Emmy, you can use the native tools of the selected MPI as described here, i.e. calling
mpirun. Alternatively, you can also use the generic SLURM way by calling
srun with the option
--mpi=pmi2. Both startup mechanisms are able to get parameters like number of nodes or processes per node directly from the SLURM scheduler, i.e. the parameters that you specified in the header of your job script. This means that you don’t have to necessarily specify any parameters for
srun. However if you do, you should keep them consistent to avoid unpredictable behavior.
Intels ark lists some technical details about the Xeon E5-2630v4 processor.
|Clock speed||Base: 2.2 GHz, Turbo (1 core): 3.1 GHz, Turbo (all cores): 2.4 GHz|
|Number of cores||10 per socket|
|L1 cache||32 KiB per core (private)|
|L2 cache||256 KiB per core (private)|
|L3 cache||2.5 MiB per core (shared by all cores)|
|Peak performance @ base frequency||35.2 Gflop/s per core (16 flops/cy)|
|Supported SIMD extension||AVX2 with FMA|
|STREAM triad bandwidth per socket||53.5 Gbyte/s (standard stores; corrected for write-allocate transfers)|
Omni-Path is essentially Intels proprietary implementation of “Infiniband”, after they acquired the Infiniband-part of QLogic. It shares most of the features and shortcomings of QLogic-based Infiniband networks.
Each node in Meggie has a 100 GBit Omni-Path-card, and is connected to a 100 GBit switch. However, the backbone of the network is not fully non-blocking: On each leaf-switch, 32 of the 48 ports are used for compute nodes, and 16 ports are used for the uplink, meaning there is a 1:2 blocking on the backbone.
As a result, if the nodes of your jobs are not all connected to the same switch, you may notice significant performance fluctuations due to the oversubscribed network. The batch system tries to run jobs on the same leaf switch if possible, but for obvious reasons that is not always possible, and for jobs utilizing more than 32 nodes is straight out impossible.
Compared to the Mellanox-IB-cards in our other clusters, you will also notice that the Omni-Path stack is a horrible CPU-hog. It can easily steal two whole CPUs, so if your job communicates a lot, it might be helpful to not use all cores of a node.