This is an old revision of the document!

How to Compile CP2K with CUDA Support

Currently three major operations in CP2K support CUDA-acceleration:

Anything that uses dbcsr_multiply, i.e. sparse matrix multiplication, when compiled with -D__ACC -D__DBCSR_ACC. This benefits in particular the linear scaling DFT code. See also the DBCSR project.
FFTs, when compiled with -D__PW_CUDA.
If linked against an accelerated scalapack/blas library (in particular pdgemm/pdsyrk/dgemm) that executes these calls on the GPU. The impact of this is most visible for MP2 and RPA calculations. On the hybrid Cray XC50 linking against cray-libsci_acc makes this happen.

To enable all CUDA acceleration options the following lines have to be added to the ARCH-file:

NVCC    = /path_to_cuda/bin/nvcc
DFLAGS += -D__ACC -D__DBCSR_ACC -D__PW_CUDA
LIBS   += -lcudart -lcublas -lcufft -lrt

See here for details. As a prerequisite the Nvidia CUDA Toolkit has to be installed.

Libcusmm

The acceleration of DBCSR is performed by libcusmm. This library provides a number of kernels. Each of these kernels can multiply blocks of specific blocksizes. The blocksizes of a simulation are determined by the employed basis-set. As of DBCSR 1.1, by default libcusmm is able to generate any kernel for {m,n,k}≤80, see here for more details. The DBCSR Statistics are printed at the end of every CP2K-run, example

 -------------------------------------------------------------------------------
 -                                                                             -
 -                                DBCSR STATISTICS                             -
 -                                                                             -
 -------------------------------------------------------------------------------
 COUNTER                                      CPU                  ACC      ACC%
 number of processed stacks                   160                   64      28.6
 matmuls inhomo. stacks                     11880                    0       0.0
 matmuls total                             132360                53530      28.8
 flops  13 x   13 x   13                        0             33218640     100.0
 flops  24 x   13 x   13                        0             55177824     100.0
...
 flops total                           1452705420            657928368      31.2
 marketing flops                       2048000000
 -------------------------------------------------------------------------------

More supported GPUs can be added, please refer to this howto.

here.

New kernel parameters have to be optimized, which this howto explains in detail.

Profiling

If you are interested in profiling CP2K with nvprof have a look at these remarks .