Talk at USI in Lugano – Performance Engineering for HPC: Models generating insights
Performance Engineering for HPC: Models generating insights
Speaker: Gerhard Wellein, University of Erlangen-Nuremberg, Germany
Wednesday, March 29, 2017, 13:30
USI Lugano Campus, room SI-006, informatics building (Via G. Buffi 13)
Abstract
We consider Performance Engineering (PE) as a structured, iterative process for code optimization and parallelization. The key ingredient is a white-box performance model which provides insights into the interaction between the code and the hardware. The model identifies the actual performance-limiting factors (“bottlenecks”), allowing for a selection of appropriate code changes. Once the impact of the code changes is validated the process restarts with a new bottleneck identified by the performance model. Since this model- based approach provides a thorough understanding of the impact of hardware features on code performance it is also useful in various other areas such as performance reproducibility, performance prediction for future architectures, or education and training. The talk will first introduce our PE concept and survey basic “white-box” performance models. Focus-ing on work performed in the “Equipping Sparse Solvers for Exasca le” (ESSEX) project we will demonstrate various aspects of PE in the context of sparse eigenvalue solvers for quantum physics applications. Here a thorough understanding of modern hardware concepts led to the proposal of a new sparse matrix data format, which delivers high performance for many matrix structures on all modern HPC compute devices (multicore CPUs, Intel Xeon Phi, Nvidia GPGPUs). The benefit of using (simple) analytic models in performance optimization is demonstrated for a Kernel Polynomial Method (KPM) based solver, which computes the spectral density of large sparse matrices. By designing specific kernel operations and applying blocking on interleaved vectors, this sparse KPM solver has been accelerated by 3-4 on the node level, delivering about 10% of peak performance on various generations of modern Intel CPUs and Nvidia GPGPUs. These improvements finally enabled us to achieve sustained performance in the PetaFlop/s range for large scale (heterogeneous) K PM calculations on sparse matrices from quantum physics applications.
Acknowledgment
This work was supported by the German Research Foundation (DFG) through the Priority Programs 1648 “Software for Exascale Computing” under project ESSEX (see https://blogs.fau.de/essex/).