User Tools

Site Tools


howto:compile_with_cuda

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
Last revisionBoth sides next revision
howto:compile_with_cuda [2019/04/09 09:35] alazzarohowto:compile_with_cuda [2019/12/18 11:12] alazzaro
Line 4: Line 4:
   * Anything that uses ''dbcsr_multiply'', i.e. sparse matrix multiplication, when compiled with ''%%-D__ACC -D__DBCSR_ACC%%''. This benefits in particular the [[doi>10.1021/ct200897x| linear scaling DFT]] code. See also [[http://dbcsr.cp2k.org | the DBCSR project.]]   * Anything that uses ''dbcsr_multiply'', i.e. sparse matrix multiplication, when compiled with ''%%-D__ACC -D__DBCSR_ACC%%''. This benefits in particular the [[doi>10.1021/ct200897x| linear scaling DFT]] code. See also [[http://dbcsr.cp2k.org | the DBCSR project.]]
   * FFTs, when compiled with ''%%-D__PW_CUDA%%''.   * FFTs, when compiled with ''%%-D__PW_CUDA%%''.
-  * If linked against an accelerated scalapack/blas library (in particular pdgemm/pdsyrk/dgemm) that executes these calls on the GPU. The impact of this is most visible for MP2 and RPA calculations. On the hybrid Cray XC30 linking against [[http://docs.cray.com/cgi-bin/craydoc.cgi?mode=Show;q=;f=man/xk_libscim/11/cat3/intro_libsci_acc.3s.html|libsci_acc]] makes this happen.+  * If linked against an accelerated scalapack/blas library (in particular pdgemm/pdsyrk/dgemm) that executes these calls on the GPU. The impact of this is most visible for MP2 and RPA calculations. On the hybrid Cray XC50 linking against cray-libsci_acc makes this happen.
  
 To enable all CUDA acceleration options the following lines have to be added to the ARCH-file: To enable all CUDA acceleration options the following lines have to be added to the ARCH-file:
Line 10: Line 10:
 NVCC    = /path_to_cuda/bin/nvcc NVCC    = /path_to_cuda/bin/nvcc
 DFLAGS += -D__ACC -D__DBCSR_ACC -D__PW_CUDA DFLAGS += -D__ACC -D__DBCSR_ACC -D__PW_CUDA
-LIBS   += -lcudart -lcublas -lcufft -lrt+LIBS   += -lcudart -lcublas -lcufft -lnvrtc
 </code> </code>
  
-See [[https://github.com/cp2k/cp2k/blob/master/INSTALL.md#2j-cuda-optional-improved-performance-on-gpu-systems | here]] for more details.+See [[https://github.com/cp2k/cp2k/blob/master/INSTALL.md#2j-cuda-optional-improved-performance-on-gpu-systems | here]] for details.
 As a prerequisite the [[https://developer.nvidia.com/cuda-toolkit |Nvidia CUDA Toolkit ]] has to be installed. As a prerequisite the [[https://developer.nvidia.com/cuda-toolkit |Nvidia CUDA Toolkit ]] has to be installed.
  
  
 ===== Libcusmm ===== ===== Libcusmm =====
-The acceleration of DBCSR is performed by libcusmm. This library provides a number of kernels. Each of these kernels can multiply blocks of specific blocksizes. The blocksizes of a simulation are determined by the employed basis-set. As of DBCSR 1.1, by default libcusmm is able to generate any kernel where {m,n,k}<80, see [[ https://github.com/cp2k/dbcsr/blob/develop/src/acc/libsmm_acc/libcusmm/README.md | here ]] for more details. +The acceleration of DBCSR is performed by libcusmm. This library provides a number of kernels. Each of these kernels can multiply blocks of specific blocksizes. The blocksizes of a simulation are determined by the employed basis-set. As of DBCSR 1.1, by default libcusmm is able to generate any kernel for {m,n,k}80, see [[ https://github.com/cp2k/dbcsr/blob/develop/src/acc/libsmm_acc/libcusmm/README.md | here]] for more details. The //DBCSR Statistics// are printed at the end of every CP2K-run, example
  
-complied with about 200 common kernels. However, if an exotic basis set is used the particular blocksizes might be missing. This can be seen from the //DBCSR Statistics//, which is printed at the end of every CP2K-run. 
- 
-In the following example the kernel for 13x13x15 was missing: 
 <code> <code>
  -------------------------------------------------------------------------------  -------------------------------------------------------------------------------
Line 34: Line 31:
  matmuls total                             132360                53530      28.8  matmuls total                             132360                53530      28.8
  flops  13 x   13 x   13                        0             33218640     100.0  flops  13 x   13 x   13                        0             33218640     100.0
- flops  13 x   13 x   15                 34810620                    0       0.0  <-- kernel missing 
  flops  24 x   13 x   13                        0             55177824     100.0  flops  24 x   13 x   13                        0             55177824     100.0
 ... ...
Line 42: Line 38:
 </code> </code>
  
- +More supported GPUs can be added, please refer to [[https://github.com/cp2k/dbcsr/blob/develop/src/acc/libsmm_acc/libcusmm/tune.md | this howto]].
-There are over 2300 readily optimized kernel-parameters available in [[src>src/dbcsr/libsmm_acc/libcusmm/]]. +
-If the desired kernel is already listed in one of the ''parameters'' files then it can be included in libcusmm by editing the file [[src>src/dbcsr/libsmm_acc/libcusmm/generate.py | generate.py]]. Otherwise new kernel parameters have to be optimized, which [[howto:libcusmm | this howto]] explains in detail.+
  
 ===== Profiling ===== ===== Profiling =====
-If you are interested in profiling CP2K with nvprof have a look at [[dev:profiling#nvprof | these remarks ]].+If you are interested in profiling CP2K with nvprof have a look at [[dev:profiling#nvprof | these remarks]].
howto/compile_with_cuda.txt · Last modified: 2020/08/21 10:15 by 127.0.0.1