This is an old revision of the document!
How to Compile CP2K with CUDA Support
Currently three major operations in CP2K support CUDA-acceleration:
- Anything that uses
dbcsr_multiply
, i.e. sparse matrix multiplication, when compiled with-D__ACC -D__DBCSR_ACC
. This benefits in particular the linear scaling DFT code. See also the DBCSR project. - FFTs, when compiled with
-D__PW_CUDA
. - If linked against an accelerated scalapack/blas library (in particular pdgemm/pdsyrk/dgemm) that executes these calls on the GPU. The impact of this is most visible for MP2 and RPA calculations. On the hybrid Cray XC30 linking against libsci_acc makes this happen.
To enable all CUDA acceleration options the following lines have to be added to the ARCH-file:
NVCC = /path_to_cuda/bin/nvcc DFLAGS += -D__ACC -D__DBCSR_ACC -D__PW_CUDA LIBS += -lcudart -lcublas -lcufft -lrt
As a prerequisite the Nvidia CUDA Toolkit has to be installed.
Libcusmm
The acceleration of DBCSR is performed by libcusmm. This library provides a number of kernels. Each of these kernels can multiply blocks of specific blocksizes. The blocksizes of a simulation are determined by the employed basis-set. By default libcusmm is complied with about 200 common kernels. However, if an exotic basis set is used the particular blocksizes might be missing. This can be seen from the DBCSR Statistics, which is printed at the end of every CP2K-run.
In the following example the kernel for 13x13x15 was missing:
------------------------------------------------------------------------------- - - - DBCSR STATISTICS - - - ------------------------------------------------------------------------------- COUNTER CPU ACC ACC% number of processed stacks 160 64 28.6 matmuls inhomo. stacks 11880 0 0.0 matmuls total 132360 53530 28.8 flops 13 x 13 x 13 0 33218640 100.0 flops 13 x 13 x 15 34810620 0 0.0 <-- kernel missing flops 24 x 13 x 13 0 55177824 100.0 ... flops total 1452705420 657928368 31.2 marketing flops 2048000000 -------------------------------------------------------------------------------
There are over 2300 readily optimized kernel-parameters available in src/dbcsr/libsmm_acc/libcusmm/.
If the desired kernel is already listed in one of the parameters
files then it can be included in libcusmm by editing the file generate.py. Otherwise new kernel parameters have to be optimized, which this howto explains in detail.
Profiling
If you are interested in profiling CP2K with nvprof have a look at these remarks .