Differences

This shows you the differences between two versions of the page.

--- howto:compile_with_cuda [2019/04/09 09:35] – alazzaro
+++ howto:compile_with_cuda [2019/12/18 11:12] – alazzaro
@@ Line 4: / Line 4: @@
   * Anything that uses ''dbcsr_multiply'', i.e. sparse matrix multiplication, when compiled with ''%%-D__ACC -D__DBCSR_ACC%%''. This benefits in particular the [[doi>10.1021/ct200897x| linear scaling DFT]] code. See also [[http://dbcsr.cp2k.org | the DBCSR project.]]
   * FFTs, when compiled with ''%%-D__PW_CUDA%%''.
-  * If linked against an accelerated scalapack/blas library (in particular pdgemm/pdsyrk/dgemm) that executes these calls on the GPU. The impact of this is most visible for MP2 and RPA calculations. On the hybrid Cray XC30 linking against [[http://docs.cray.com/cgi-bin/craydoc.cgi?mode=Show;q=;f=man/xk_libscim/11/cat3/intro_libsci_acc.3s.html|libsci_acc]] makes this happen.
+  * If linked against an accelerated scalapack/blas library (in particular pdgemm/pdsyrk/dgemm) that executes these calls on the GPU. The impact of this is most visible for MP2 and RPA calculations. On the hybrid Cray XC50 linking against cray-libsci_acc makes this happen.
 To enable all CUDA acceleration options the following lines have to be added to the ARCH-file:
@@ Line 10: / Line 10: @@
 NVCC    = /path_to_cuda/bin/nvcc
 DFLAGS += -D__ACC -D__DBCSR_ACC -D__PW_CUDA
-LIBS   += -lcudart -lcublas -lcufft -lrt
+LIBS   += -lcudart -lcublas -lcufft -lnvrtc
 </code>
-See [[https://github.com/cp2k/cp2k/blob/master/INSTALL.md#2j-cuda-optional-improved-performance-on-gpu-systems | here]] for more details.
+See [[https://github.com/cp2k/cp2k/blob/master/INSTALL.md#2j-cuda-optional-improved-performance-on-gpu-systems | here]] for details.
 As a prerequisite the [[https://developer.nvidia.com/cuda-toolkit |Nvidia CUDA Toolkit ]] has to be installed.
 ===== Libcusmm =====
-The acceleration of DBCSR is performed by libcusmm. This library provides a number of kernels. Each of these kernels can multiply blocks of specific blocksizes. The blocksizes of a simulation are determined by the employed basis-set. As of DBCSR 1.1, by default libcusmm is able to generate any kernel where {m,n,k}<80, see [[ https://github.com/cp2k/dbcsr/blob/develop/src/acc/libsmm_acc/libcusmm/README.md | here ]] for more details.
+The acceleration of DBCSR is performed by libcusmm. This library provides a number of kernels. Each of these kernels can multiply blocks of specific blocksizes. The blocksizes of a simulation are determined by the employed basis-set. As of DBCSR 1.1, by default libcusmm is able to generate any kernel for {m,n,k}≤80, see [[ https://github.com/cp2k/dbcsr/blob/develop/src/acc/libsmm_acc/libcusmm/README.md | here]] for more details. The //DBCSR Statistics// are printed at the end of every CP2K-run, example
-complied with about 200 common kernels. However, if an exotic basis set is used the particular blocksizes might be missing. This can be seen from the //DBCSR Statistics//, which is printed at the end of every CP2K-run.
-In the following example the kernel for 13x13x15 was missing:
 <code>
  -------------------------------------------------------------------------------
@@ Line 34: / Line 31: @@
  matmuls total                             132360                53530      28.8
  flops  13 x   13 x   13                        0             33218640     100.0
- flops  13 x   13 x   15                 34810620                    0       0.0  <-- kernel missing
  flops  24 x   13 x   13                        0             55177824     100.0
 ...
@@ Line 42: / Line 38: @@
 </code>
+More supported GPUs can be added, please refer to [[https://github.com/cp2k/dbcsr/blob/develop/src/acc/libsmm_acc/libcusmm/tune.md | this howto]].
-There are over 2300 readily optimized kernel-parameters available in [[src>src/dbcsr/libsmm_acc/libcusmm/]].
-If the desired kernel is already listed in one of the ''parameters'' files then it can be included in libcusmm by editing the file [[src>src/dbcsr/libsmm_acc/libcusmm/generate.py | generate.py]]. Otherwise new kernel parameters have to be optimized, which [[howto:libcusmm | this howto]] explains in detail.
 ===== Profiling =====
-If you are interested in profiling CP2K with nvprof have a look at [[dev:profiling#nvprof | these remarks ]].
+If you are interested in profiling CP2K with nvprof have a look at [[dev:profiling#nvprof | these remarks]].