Job monitoring with ClusterCockpit

Introduction

System monitoring of cluster systems is a crucial task for system administrators but there is also the users of these systems are interested in a part of the collected metrics. For users there are also additional metrics of interest like floating-point performance or memory bandwidth. The NHR@FAU provides job-specific monitoring for clusters already for quite some time but with the installation of the Fritz and Alex cluster, the whole system has been re-created. The development is led by NHR@FAU but also other NHR centers are contributing by enhancing or just using (and therefore testing) the framework. The whole stack is called ClusterCockpit and contains multiple components:

  1. Node agent on each compute node: cc-metric-collector
  2. In-memory short-term and file-based long-term storage: cc-metric-store
  3. Webfrontend with authentication for all users: cc-backend

Setup at NHR@FAU

The main point of access for the users is monitoring.nhr.fau.de. For authentication the HPC account is required, not the IDM account.

Integrated clusters:

Scope of user accounts

Users can only see their own jobs in the monitoring.

Different Views

ClusterCockpit provides different views of the systems depending on the scope of your account.

Job list

The job list contains all currently running jobs with job information like requested resources and a limited set of plots that give a first impression of the quality of a job.

If you click on the job id on the left, the job-specific page with more information and plots is shown.

It takes a few minutes after job start that it is shown in the list of running jobs.

User section

In the user section, each user can check the history of the jobs including some statistics.

Tag section

Users can enrich the information of a job with tags, a key/value pair, describing the job. In the tag section, you can select tags and get a list of all jobs with the requested tags.

Reporting problems with ClusterCockpit

If you have problems with the setup at NHR@FAU, please contact the common support hpc-support@fau.de.

For general questions about ClusterCockpit and it’s development, there are two separate matrix chats:

Dr. Jan Eitzinger

Head of Software & Tools

Erlangen National High Performance Computing Center
Software & Tools

Thomas Gruber

Development LIKWID, ClusterCockpit & MachineState

Erlangen National High Performance Computing Center
Software & Tools Division