====== CP2K Benchmark Suite ======

===== Introduction =====

The purpose of the CP2K benchmark suite is to provide performance which can be used to guide users towards the best configuration (e.g. machine, number of MPI processors, number of OpenMP threads) for a particular problem, and give a good estimation for the parallel performance of the code for different types of method. Five benchmarks are provided: ''H2O-64'', ''Fayalite-FIST'', ''LiH-HFX'', ''H2O-DFT-LS'' and ''H2O-64-RI-MP2''. Descriptions of each benchmark along with performance figures are below.

We encourage you to contribute benchmark results from your own local cluster or HPC system - just run the inputs and add timings in the relevant sections below.  Python scripts for generating the scaling graphs are provided [[src>tools/benchmark_plots/]].  Please also update the [[performance:systems|list of machines]] for which benchmark data is provided.

If you have any questions or problems running benchmarks or using the scripts please contact Iain Bethune (<ibethune@epcc.ed.ac.uk>).

===== Notes on Results =====

Some benchmarks perform MD, whilst the more expensive methods only a single-point energy computation, therefore the total time for the calculation against the number of compute nodes used is reported.  Each benchmark uses a different system, so the results are not directly comparable.

The mixed mode MPI/OpenMP version of CP2K is used to measure performance (there is negligible overhead from running this version with 1 thread per process compared to the pure MPI code). For a fixed number of cores, all reasonable combinations of MPI processes and OpenMP threads were tested, subject to keeping each processes' threads within a single NUMA region. For example on ARCHER, 6 cores share a single NUMA region, so no more than 6 threads per process were used as the resulting performance would be very poor. From these combinations, the best run time and number of threads per process is reported. As most HPC systems charge by the node, full nodes were utilised at all times.

This systems used to obtain the benchmark results are described on the [[performance:systems|systems page]].

===== Benchmarks =====

==== H2O-64 ====

=== Description ===

//Ab-initio// molecular dynamics of liquid water using the Born-Oppenheimer approach, using [[Quickstep]] DFT. Production quality settings for the basis sets (TZV2P) and the planewave cutoff (280 Ry) are chosen, and the Local Density Approximation (LDA) is used for the calculation of the Exchange-Correlation energy.  The configurations were generated by classical equilibration, and the initial guess of the electronic density is made based on Atomic Orbitals.  The system contains 64 water molecules (192 atoms, 512 electrons) in a 12.4 Å<sup>3</sup> cell and MD is run for 10 steps.

=== Availability ===

The benchmark is available (along with other water systems) from the CP2K source distribution:
[[src>benchmarks/QS/]]

=== Results ===

The best configurations are shown below. Click the links to see more detail.

^ Machine Name ^ Architecture ^ Date ^ Git Commit ^ Fastest time (s) ^ Configuration ^^ Detailed results ^
| HECToR    | Cray XE6   | 21/01/2014 | [[commit>82b8204]] | 39.066 | 512 cores | 2 OMP threads per MPI task | [[performance:hector-h2o-64]] |
| ARCHER    | Cray XC30  | 08/01/2014 | [[commit>292a983]] | 18.11 | 576 cores | 1 OMP thread per MPI task | [[performance:archer-h2o-64]] |
| Magnus    | Cray XC40  | 22/10/2014 | [[commit>27eacee]] | 17.275 | 384 cores | 1 OMP thread per MPI task | [[performance:magnus-h2o-64]] |
| Piz Daint | Cray XC30  | 12/05/2015 | [[commit>f439118]] | 19.885 | 192 cores | 1 OMP thread per MPI task, no GPU | [[performance:piz-daint-h2o-64]] |
| Cirrus    | SGI ICE XA | 24/11/2016 | [[commit>989a92c]] | 15.560 | 1152 cores | 9 OMP threads per MPI task | [[performance:cirrus-h2o-64]] |
| Noctua    | Cray CS500 | 25/09/2019 | [[commit>9f58d81]] | 13.3 | 640 cores | 10 OMP thread per MPI task | [[performance:noctua-h2o-64]] |

==== Fayalite-FIST ====

=== Description ===

This is a short molecular dynamics run of 1000 time steps in a NPT ensemble at 300K. It consists of 28000 atoms - a 10<sup>3</sup> supercell with 28 atoms of iron silicate (Fe<sub>2</sub>SiO<sub>4</sub>, also known as Fayalite) per unit cell. The simulation employs a classical potential (Morse with a hard-core repulsive term and 5.5 Å cutoff) with long-range electrostatics using Smoothed Particle Mesh Ewald (SPME) summation. While CP2K does support classical potentials via the Frontiers In Simulation Technology (FIST) module, this is not a typical calculation for CP2K but is included to give an impression of the performance difference between machines for the MM part of a QM/MM calculation.

=== Availability ===

The benchmark is available from the CP2K source distribution:
[[src>benchmarks/Fist/]]

=== Results ===

The best configurations are shown below. Click the links to see more detail.

^ Machine Name ^ Architecture ^ Date ^ Git Commit ^ Fastest time (s) ^ Configuration ^^ Detailed results ^
| HECToR    | Cray XE6   | 21/01/2014 | [[commit>82b8204]] | 403.928 | 2048 cores | 4 OMP threads per MPI task | [[performance:hector-fayalite-fist]] |
| ARCHER    | Cray XC30  | 09/01/2014 | [[commit>292a983]] | 197.117 | 576 cores | 6 OMP threads per MPI task | [[performance:archer-fayalite-fist]] |
| Magnus    | Cray XC40  | 06/11/2014 | [[commit>27eacee]] | 150.493 | 768 cores | 6 OMP threads per MPI task | [[performance:magnus-fayalite-fist]] |
| Piz Daint | Cray XC30  | 12/05/2015 | [[commit>f439118]] | 207.972 | 512 cores | 2 OMP threads per MPI task, no GPU | [[performance:piz-daint-fayalite-fist]] |
| Cirrus    | SGI ICE XA | 24/11/2016 | [[commit>989a92c]] | 166.192 | 576 cores | 2 OMP threads per MPI task | [[performance:cirrus-fayalite-fist]] |
| Noctua    | Cray CS500 | 25/09/2019 | [[commit>9f58d81]] | 119.820 | 2560 cores | 10 OMP thread per MPI task | [[performance:noctua-fayalite-fist]] |

==== LiH-HFX ====

=== Description ===

This is a single-point energy calculation using [[Quickstep]] GAPW (Gaussian and Augmented Plane-Waves) with hybrid Hartree-Fock exchange. It consists of a 216 atom Lithium Hydride crystal with 432 electrons in a 12.3 Å<sup>3</sup> cell. These types of calculations are generally around one hundred times the computational cost of a standard local DFT calculation, although this can be reduced using the Auxiliary Density Matrix Method (ADMM). Using OpenMP is of particular benefit here as the HFX implementation requires a large amount of memory to store partial integrals. By using several threads, fewer MPI processes share the available memory on the node and thus enough memory is available to avoid recomputing any integrals on-the-fly, improving performance.

=== Availability ===

The benchmark is available from [[src>benchmarks/QS_LiH_HFX/]].

=== Results ===

The best configurations are shown below. Click the links to see more detail.

^ Machine Name ^ Architecture ^ Date ^ Git Commit ^ Fastest time (s) ^ Configuration ^^ Detailed results ^
| HECToR    | Cray XE6   | 21/01/2014 | [[commit>82b8204]] (*) | 121.362 | 65536 cores | 8 OMP threads per MPI task | [[performance:hector-lih-hfx]] |
| ARCHER    | Cray XC30  | 09/01/2014 | [[commit>292a983]] (*) | 51.172 | 49152 cores | 6 OMP threads per MPI task | [[performance:archer-lih-hfx]] |
| Magnus    | Cray XC40  | 10/11/2014 | [[commit>27eacee]] (*) | 62.075 | 24576 cores | 4 OMP threads per MPI task | [[performance:magnus-lih-hfx]] |
| Piz Daint | Cray XC30  | 12/05/2015 | [[commit>f439118]] | 66.051 | 32768 cores | 4 OMP threads per MPI task, no GPU | [[performance:piz-daint-lih-hfx]] |
| Cirrus    | SGI ICE XA | 24/11/2016 | [[commit>989a92c]] | 483.676 | 2016 cores | 6 OMP threads per MPI task | [[performance:cirrus-lih-hfx]] |

(*) Prior to r14945, a bug resulted in an underestimation of the number of ERIs which should be computed (by roughly 50% for this benchmark.  Therefore these results cannot be compared directly with later ones.
==== H2O-DFT-LS ====

=== Description ===

This is a single-point energy calculation using linear-scaling DFT. It consists of 6144 atoms in a 39 Å<sup>3</sup> box (2048 water molecules in total). An LDA functional is used with a DZVP MOLOPT basis set and a 300 Ry cut-off.  For large systems the linear-scaling approach for solving Self-Consistent-Field equations will be much cheaper computationally than using standard DFT and allows scaling up to 1 million atoms for simple systems. The linear scaling cost results from the fact that the algorithm is based on an iteration on the density matrix.  The cubically-scaling orthogonalisation step of standard [[Quickstep]] DFT using OT is avoided and the key operation is sparse matrix-matrix multiplications, which have a number of non-zero entries that scale linearly with system size. These are implemented efficiently in the DBCSR library.

=== Availability ===

The benchmark input file used to generate these results is {{performance:h2o-dft-ls-4.inp.gz|available here}}.

It is a slightly modified version of the more general one in the CP2K github at [[src>benchmarks/QS_DM_LS/H2O-dft-ls.inp]], where the problem size can be tuned by a parameter NREP. 

=== Results ===

The best configurations are shown below. Click the links to see more detail.

^ Machine Name ^ Architecture ^ Date ^ Git Commit ^ Fastest time (s) ^ Configuration ^^ Detailed results ^
| HECToR    | Cray XE6   | 16/01/2014 | [[commit>82b8204]] | 98.256 | 65536 cores | 8 OMP threads per MPI task | [[performance:hector-h2o-dft-ls]] |
| ARCHER    | Cray XC30  | 08/01/2014 | [[commit>292a983]] | 28.476 | 49152 cores | 4 OMP threads per MPI task | [[performance:archer-h2o-dft-ls]] |
| Magnus    | Cray XC40  | 03/12/2014 | [[commit>27eacee]] | 30.921 | 24576 cores | 2 OMP threads per MPI task | [[performance:magnus-h2o-dft-ls]] |
| Piz Daint | Cray XC30  | 12/05/2015 | [[commit>f439118]] | 27.900 | 32768 cores | 2 OMP threads per MPI task, no GPU | [[performance:piz-daint-h2o-dft-ls]] |
| Cirrus    | SGI ICE XA | 24/11/2016 | [[commit>989a92c]] | 543.032 | 2016 cores | 2 OMP threads per MPI task | [[performance:cirrus-h2o-dft-ls]] |
| Noctua    | Cray CS500 | 25/09/2019 | [[commit>9f58d81]] | 37.730 | 10240 cores | 10 OMP thread per MPI task | [[performance:noctua-h2o-dft-ls]] |

==== H2O-64-RI-MP2 ====

=== Description ===

This benchmark is a single-point energy calculation using 2nd order Møller-Plesset perturbation theory (MP2) with the Resolution-of-the-Identity approximation to calculate the exchange-correlation energy. The system consists of 64 water molecules in a 12.4 Å<sup>3</sup> cell. This is exactly the same system as used by H2O-64 but using a much more accurate model, which is around 100 times more computationally demanding than standard DFT calculations.

=== Availability ===

The benchmark is in the CP2K github at: [[src>benchmarks/QS_mp2_rpa/64-H2O/]].

=== Results ===

The best configurations are shown below. Click the links to see more detail.

^ Machine Name ^ Architecture ^ Date ^ Git Commit ^ Fastest time (s) ^ Configuration ^^ Detailed results ^
| HECToR    | Cray XE6   | 13/01/2014 | [[commit>82b8204]] | 141.633 | 49152 cores | 8 OMP threads per MPI task | [[performance:hector-h2o-64-ri-mp2]] |
| ARCHER    | Cray XC30  | 09/01/2014 | [[commit>292a983]] | 83.945 | 36864 cores | 4 OMP threads per MPI task | [[performance:archer-h2o-64-ri-mp2]] |
| Magnus    | Cray XC40  | 04/11/2014 | [[commit>27eacee]] | 63.891 | 24576 cores | 6 OMP threads per MPI task | [[performance:magnus-h2o-64-ri-mp2]] |
| Piz Daint | Cray XC30  | 12/05/2015 | [[commit>f439118]] | 48.15 | 32768 cores | 8 OMP threads per MPI task, no GPU | [[performance:piz-daint-h2o-64-ri-mp2]] |
| Cirrus    | SGI ICE XA | 24/11/2016 | [[commit>989a92c]] | 303.571 | 2016 cores | 1 OMP thread per MPI task | [[performance:cirrus-h2o-64-ri-mp2]] |
| Noctua    | Cray CS500 | 25/09/2019 | [[commit>9f58d81]] | 82.571 | 10240 cores | 2 OMP thread per MPI task | [[performance:noctua-h2o-64-ri-mp2]] |