Call for Participation: Forum on Scheduler and Queuing Systems
We are pleased to invite you to the upcoming hpc-ch forum on
Scheduler and Queuing Systems
to be held on Wednesday 18 May 2011, 9:45 – 17:00
kindly hosted by PSI.
Setting the Scope
Schedulers and queuing systems are an essential component of all HPC systems. Their overall goal is easily explained: Fairly distribute the jobs on the nodes of a supercomputer according to the policies of the centre while optimizing the usage of the available resources. The implementation can be more complex than expected. Like any other optimization problem, there is no single solution, but different heuristics can be more appropriate depending on the criteria. The optimization is made more complex by the increasing heterogeneity of the resources: Future supercomputers will possibly be composed of a set of CPUs and GPUs with a common interconnect. What will the allocation algorithms look like then? Finally, this is the field in supercomputing where psychological aspects are essential: Do the users perceive the queuing system as fair and predictable? What user behavior is rewarded, what is punished? In a global world of standardized technologies, the algorithms and policies behind schedulers and queuing systems may be the last possible differentiation between competing data centers. Are we really ready to share this knowledge with our colleagues?
Key Questions
- What are the usage policies and how do you translate them in allocation algorithms? What about over allocation, over use, under use, discounts etc.?
- What scheduler and queuing systems are you using? Why did you choose this particular system?
- What is most important to you: price, performance, reliability, support?
- What are the costs of scheduler and queuing systems ? Are open source products really cheaper than proprietary? What are you experiences?
- What could be possible metrics to measure the performance of queuing systems?
- What are your experiences in the integration of heterogenous systems? (e.g. GPUs, nodes of different sizes)
- Dispatching on the same system from one core up to very large parallel jobs: Nice to have, better not to have or it does not matter?
- How much should be hidden and shown to the end user? Is it better to have a single generic queue or to have specialized queues?
- How do you interpret usage accounting and priority settings?
- Do you have any particular experiences you want to share?