hpc-ch Forum on monitoring and reporting in HPC

Dear members and guests of the hpc-ch community,

We are pleased to invite you to the upcoming hpc-ch forum on

Monitoring and reporting in HPC

to be held on Thursday, October 23th, 2014 from 9:30 until 16:45, kindly hosted by University of Bern.

Please, register via the website until October 10th, 2014, and let me know via e-mail if you’re willing to contribute with a short presentation (20 minutes including Q&A).

Running an HPC system is like commanding a starship: We are sitting on the control bridge and have to look at different displays, monitors, radars and alerts to understand the general health of the spaceship and make sure we can continue our journey and arrive safely at our destination. There are hundreds of components that contribute to the functioning of the whole system; there are alerts at different levels being raised from hardware and software components; there are early detection and diagnostics tools that could help us reduce downtimes and ensure robust operations.

This hpc-ch Forum will give us the possibility to discuss different aspects of monitoring and reporting in the HPC field.

Key Questions

  • What elements of our HPC systems does it make sense to monitor?  (CPU, RAM, Ethernet & Infiniband network, I/O interfaces, services, end-to-end cases, storage, middleware, …)
  • What tools are available to collect, analyze and display this information?  What information do we make available to our users and to the public? Can ITIL Service Design concepts offer some guidance for ensuring robust HPC services?
  • What are the best practices for gathering error logs and warning messages? What are efficient mechanisms for generating alerts? What escalation levels do we support? How can we react during the night, weekends and longer holidays?
  • How can monitoring improve the security of our systems? (Netflow statistics, …)
  • What kind of reports do we provide routinely (current load, available resources, …)? Which reports are relevant to whom?
  • Which tools and techniques are widely used for statistical analysis of system logs?  What strategies are used to correlate hardware and application errors? In an RFP and procurement of an HPC system, are there typical requirements or metrics for monitoring and reporting needs?

Location

Universität Bern, Campus von Roll
Fab6 Hörsaalgebäude, Room 104-6
Fabrikstrasse 6
3012 Bern

Chairmanship

  • Nina Mujkanovic, University of Bern
  • Michael Rolli, University of Bern
  • Michele De Lorenzi, hpc-ch / CSCS