Job monitoring with ClusterCockpit
System monitoring of cluster systems is a crucial task for system administrators but there is also the users of these systems are interested in a part of the collected metrics. For users there are also additional metrics of interest like floating-point performance or memory bandwidth. The NHR@FAU provides job-specific monitoring for clusters already for quite some time but with the installation of the Fritz and Alex cluster, the whole system has been re-created. The development is led by NHR@FAU but also other NHR centers are contributing by enhancing or just using (and therefore testing) the framework. The whole stack is called ClusterCockpit and contains multiple components:
- Node agent on each compute node: cc-metric-collector
- In-memory short-term and file-based long-term storage: cc-metric-store
- Webfrontend with authentication for all users: cc-backend
Setup at NHR@FAU
The main point of access for the users is monitoring.nhr.fau.de. For authentication the HPC account is required, not the IDM account.
Scope of user accounts
Users can only see their own jobs in the monitoring.
ClusterCockpit provides different views of the systems depending on the scope of your account.
The job list contains all currently running jobs with job information like requested resources and a limited set of plots that give a first impression of the quality of a job.
If you click on the job id on the left, the job-specific page with more information and plots is shown.
It takes a few minutes after job start that it is shown in the list of running jobs.
In the user section, each user can check the history of the jobs including some statistics.
Users can enrich the information of a job with tags, a key/value pair, describing the job. In the tag section, you can select tags and get a list of all jobs with the requested tags.
Reporting problems with ClusterCockpit
If you have problems with the setup at NHR@FAU, please contact the common support firstname.lastname@example.org.
For general questions about ClusterCockpit and it’s development, there are two separate matrix chats: