This is an old revision of the document!

−Table of Contents

CP2K Benchmark Suite

CP2K Benchmark Suite

Introduction

The purpose of the CP2K benchmark suite is to provide performance which can be used to guide users towards the best configuration (e.g. machine, number of MPI processors, number of OpenMP threads) for a particular problem, and give a good estimation for the parallel performance of the code for different types of method. Five benchmarks are provided: H2O-64, Fayalite-FIST, LiH-HFX, H2O-DFT-LS and H2O-64-RI-MP2. Descriptions of each benchmark along with performance figures are below.

We encourage you to contribute benchmark results from your own local cluster or HPC system - just run the inputs and add timings in the relevant sections below. Python scripts for generating the scaling graphs are provided tools/benchmark_plots/. Please also update the list of machines for which benchmark data is provided.

If you have any questions or problems running benchmarks or using the scripts please contact Iain Bethune (ibethune@epcc.ed.ac.uk).

Notes on Results

Some benchmarks perform MD, whilst the more expensive methods only a single-point energy computation, therefore the total time for the calculation against the number of compute nodes used is reported. Each benchmark uses a different system, so the results are not directly comparable.

The mixed mode MPI/OpenMP version of CP2K is used to measure performance (there is negligible overhead from running this version with 1 thread per process compared to the pure MPI code). For a fixed number of cores, all reasonable combinations of MPI processes and OpenMP threads were tested, subject to keeping each processes' threads within a single NUMA region. For example on ARCHER, 6 cores share a single NUMA region, so no more than 6 threads per process were used as the resulting performance would be very poor. From these combinations, the best run time and number of threads per process is reported. As most HPC systems charge by the node, full nodes were utilised at all times.

This systems used to obtain the benchmark results are described on the systems page.

Benchmarks

H2O-64

Description

Ab-initio molecular dynamics of liquid water using the Born-Oppenheimer approach, using Quickstep DFT. Production quality settings for the basis sets (TZV2P) and the planewave cutoff (280 Ry) are chosen, and the Local Density Approximation (LDA) is used for the calculation of the Exchange-Correlation energy. The configurations were generated by classical equilibration, and the initial guess of the electronic density is made based on Atomic Orbitals. The system contains 64 water molecules (192 atoms, 512 electrons) in a 12.4 Å³ cell and MD is run for 10 steps.

Availability

The benchmark is available (along with other water systems) from the CP2K source distribution: tests/QS/benchmark/

Results

The best configurations are shown below. Click the links to see more detail.

Machine Name	Architecture	Date	SVN Revision	Fastest time (s)	Configuration		Detailed results
HECToR	Cray XE6	21/1/2014	13196	39.066	512 cores	2 OMP threads per MPI task	hector-h2o-64
ARCHER	Cray XC30	8/1/2014	13473	18.11	576 cores	1 OMP thread per MPI task	archer-h2o-64
Magnus	Cray XC40	22/10/2014	14377	17.275	384 cores	1 OMP thread per MPI task	magnus-h2o-64
Piz Daint	Cray XC30	12/05/2015	15268	19.885	192 cores	1 OMP thread per MPI task, no GPU	piz-daint-h2o-64
Cirrus	SGI ICE XA	24/11/2016	17566	15.560	1152 cores	9 OMP threads per MPI task	cirrus-h2o-64
Noctua	Cray CS500	27/04/2019	3cf5f249	16.5	320 cores	1 OMP thread per MPI task	noctua-h2o-64

Fayalite-FIST

Description

This is a short molecular dynamics run of 1000 time steps in a NPT ensemble at 300K. It consists of 28000 atoms - a 10³ supercell with 28 atoms of iron silicate (Fe₂SiO₄, also known as Fayalite) per unit cell. The simulation employs a classical potential (Morse with a hard-core repulsive term and 5.5 Å cutoff) with long-range electrostatics using Smoothed Particle Mesh Ewald (SPME) summation. While CP2K does support classical potentials via the Frontiers In Simulation Technology (FIST) module, this is not a typical calculation for CP2K but is included to give an impression of the performance difference between machines for the MM part of a QM/MM calculation.

Availability

The benchmark is available from the CP2K source distribution: tests/Fist/benchmark/

Results

The best configurations are shown below. Click the links to see more detail.

Machine Name	Architecture	Date	SVN Revision	Fastest time (s)	Configuration		Detailed results
HECToR	Cray XE6	21/1/2014	13196	403.928	2048 cores	4 OMP threads per MPI task	hector-fayalite-fist
ARCHER	Cray XC30	9/1/2014	13473	197.117	576 cores	6 OMP threads per MPI task	archer-fayalite-fist
Magnus	Cray XC40	6/11/2014	14377	150.493	768 cores	6 OMP threads per MPI task	magnus-fayalite-fist
Piz Daint	Cray XC30	12/05/2015	15268	207.972	512 cores	2 OMP threads per MPI task, no GPU	piz-daint-fayalite-fist
Cirrus	SGI ICE XA	24/11/2016	17566	166.192	576 cores	2 OMP threads per MPI task	cirrus-fayalite-fist
Noctua	Cray CS500	27/04/2019	3cf5f249	139.177	320 cores	1 OMP thread per MPI task	noctua-fayalite-fist

LiH-HFX

Description

This is a single-point energy calculation using Quickstep GAPW (Gaussian and Augmented Plane-Waves) with hybrid Hartree-Fock exchange. It consists of a 216 atom Lithium Hydride crystal with 432 electrons in a 12.3 Å³ cell. These types of calculations are generally around one hundred times the computational cost of a standard local DFT calculation, although this can be reduced using the Auxiliary Density Matrix Method (ADMM). Using OpenMP is of particular benefit here as the HFX implementation requires a large amount of memory to store partial integrals. By using several threads, fewer MPI processes share the available memory on the node and thus enough memory is available to avoid recomputing any integrals on-the-fly, improving performance.

Availability

The benchmark is available from tests/QS/benchmark_HFX/LiH/.

Results

The best configurations are shown below. Click the links to see more detail.

Machine Name	Architecture	Date	SVN Revision	Fastest time (s)	Configuration		Detailed results
HECToR	Cray XE6	21/1/2014	13196(*)	121.362	65536 cores	8 OMP threads per MPI task	hector-lih-hfx
ARCHER	Cray XC30	9/1/2014	13473(*)	51.172	49152 cores	6 OMP threads per MPI task	archer-lih-hfx
Magnus	Cray XC40	10/11/2014	14377(*)	62.075	24576 cores	4 OMP threads per MPI task	magnus-lih-hfx
Piz Daint	Cray XC30	12/05/2015	15268	66.051	32768 cores	4 OMP threads per MPI task, no GPU	piz-daint-lih-hfx
Cirrus	SGI ICE XA	24/11/2016	17566	483.676	2016 cores	6 OMP threads per MPI task	cirrus-lih-hfx
Noctua	Cray CS500	27/04/2019	3cf5f249	203.092	5120 cores	1 OMP thread per MPI task	noctua-lih-hfx

(*) Prior to r14945, a bug resulted in an underestimation of the number of ERIs which should be computed (by roughly 50% for this benchmark. Therefore these results cannot be compared directly with later ones.

H2O-DFT-LS

Description

This is a single-point energy calculation using linear-scaling DFT. It consists of 6144 atoms in a 39 Å³ box (2048 water molecules in total). An LDA functional is used with a DZVP MOLOPT basis set and a 300 Ry cut-off. For large systems the linear-scaling approach for solving Self-Consistent-Field equations will be much cheaper computationally than using standard DFT and allows scaling up to 1 million atoms for simple systems. The linear scaling cost results from the fact that the algorithm is based on an iteration on the density matrix. The cubically-scaling orthogonalisation step of standard Quickstep DFT using OT is avoided and the key operation is sparse matrix-matrix multiplications, which have a number of non-zero entries that scale linearly with system size. These are implemented efficiently in the DBCSR library.

Availability

The benchmark input file used to generate these results is available here.

It is a slightly modified version of the more general one in the CP2K SVN at tests/QS/benchmark_DM_LS/H2O-dft-ls.inp, where the problem size can be tuned by a parameter NREP.

Results

The best configurations are shown below. Click the links to see more detail.

Machine Name	Architecture	Date	SVN Revision	Fastest time (s)	Configuration		Detailed results
HECToR	Cray XE6	16/1/2014	13196	98.256	65536 cores	8 OMP threads per MPI task	hector-h2o-dft-ls
ARCHER	Cray XC30	8/1/2014	13473	28.476	49152 cores	4 OMP threads per MPI task	archer-h2o-dft-ls
Magnus	Cray XC40	3/12/2014	14377	30.921	24576 cores	2 OMP threads per MPI task	magnus-h2o-dft-ls
Piz Daint	Cray XC30	12/05/2015	15268	27.900	32768 cores	2 OMP threads per MPI task, no GPU	piz-daint-h2o-dft-ls
Cirrus	SGI ICE XA	24/11/2016	17566	543.032	2016 cores	2 OMP threads per MPI task	cirrus-h2o-dft-ls
Noctua	Cray CS500	27/04/2019	3cf5f249	77.413	5120 cores	1 OMP thread per MPI task	noctua-h2o-dft-ls

H2O-64-RI-MP2

Description

This benchmark is a single-point energy calculation using 2nd order Møller-Plesset perturbation theory (MP2) with the Resolution-of-the-Identity approximation to calculate the exchange-correlation energy. The system consists of 64 water molecules in a 12.4 Å³ cell. This is exactly the same system as used by H2O-64 but using a much more accurate model, which is around 100 times more computationally demanding than standard DFT calculations.

Availability

The benchmark is in the CP2K SVN at: tests/QS/benchmark_mp2_rpa/64-H2O/.

Results

The best configurations are shown below. Click the links to see more detail.

Machine Name	Architecture	Date	SVN Revision	Fastest time (s)	Configuration		Detailed results
HECToR	Cray XE6	13/1/2014	13196	141.633	49152 cores	8 OMP threads per MPI task	hector-h2o-64-ri-mp2
ARCHER	Cray XC30	9/1/2014	13473	83.945	36864 cores	4 OMP threads per MPI task	archer-h2o-64-ri-mp2
Magnus	Cray XC40	4/11/2014	14377	63.891	24576 cores	6 OMP threads per MPI task	magnus-h2o-64-ri-mp2
Piz Daint	Cray XC30	12/05/2015	15268	48.15	32768 cores	8 OMP threads per MPI task, no GPU	piz-daint-h2o-64-ri-mp2
Cirrus	SGI ICE XA	24/11/2016	17566	303.571	2016 cores	1 OMP thread per MPI task	cirrus-h2o-64-ri-mp2
Noctua	Cray CS500	27/04/2019	3cf5f249	101.617	5120 cores	1 OMP thread per MPI task	noctua-h2o-64-ri-mp2