User Tools

Site Tools


howto:libcusmm

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
Next revisionBoth sides next revision
howto:libcusmm [2014/10/27 18:16] oschuetthowto:libcusmm [2019/02/06 11:22] – Reflect changes brought by PR #137 to DBCSR repo sjakobovits
Line 1: Line 1:
 ====== Howto Optimize Cuda Kernels for Libcusmm ====== ====== Howto Optimize Cuda Kernels for Libcusmm ======
-=== Step 1: Go to the directory libcusmm directory ===+**Python version required:** python3.6 
 +If you are about to autotune parameters for a new GPU (i.e. a GPU for which there are no autotuned parameters yet), please first follow these instructions.  
 + 
 +=== Step 1: Go to the libcusmm directory ===
 <code> <code>
-$ cd $CP2K_ROOT/src/dbcsr/libsmm_acc/libcusmm+$ cd dbcsr/src/acc/libsmm_acc/libcusmm
 </code> </code>
  
-=== Step 2: Adopt tune.py for your Environment === +=== Step 2: Adapt tune_setup.py to your environment === 
-The ''tune.py'' script generates job files. You have to adopt the script to the environment of your supercomputer and your personal settings.+The ''tune_setup.py'' script generates job files. You have to adapt the script to the environment of your supercomputer and your personal settings.
 <code python> <code python>
 ... ...
Line 31: Line 34:
 </code> </code>
  
-=== Step 3: Run the script tune.py ===+=== Step 3: Run the script tune_setup.py ===
 The script takes as arguments the blocksizes you want to add to libcusmm. For example, if your system contains blocks of size 5 and 8 type: The script takes as arguments the blocksizes you want to add to libcusmm. For example, if your system contains blocks of size 5 and 8 type:
 <code> <code>
-$ ./tune.py 5 8+$ ./tune_setup.py 5 8
 Found 23 parameter sets for 5x5x5 Found 23 parameter sets for 5x5x5
 Found 31 parameter sets for 5x5x8 Found 31 parameter sets for 5x5x8
Line 63: Line 66:
 tune_8x8x8.job tune_8x8x8.job
 </code> </code>
-For each possible parameter-set a //launcher// is generated. A launcher is a small snipped of C code, which launches the kernel by using the cuda specific ''%%<<< >>>%%''-notation. It also instantiates the C++ template which contains the actual kernel code.+For each possible parameter-set a //launcher// is generated. A launcher is a small snippet of C code, which launches the kernel by using the cuda specific ''%%<<< >>>%%''-notation. It also instantiates the C++ template which contains the actual kernel code.
  
-In order to parallelize the benchmarking the launchers are distributed over multiple executables.+In order to parallelize the benchmarkingthe launchers are distributed over multiple executables.
 Currently, up to 10000 launchers are benchmarked by one //executable//. Each executable is linked together from several ''tune_*_part???.o'' and a ''tune_*_main.o''. Each part-files contains up to 100 launchers. This allows to parallelize the compilation over multiple CPU cores. Currently, up to 10000 launchers are benchmarked by one //executable//. Each executable is linked together from several ''tune_*_part???.o'' and a ''tune_*_main.o''. Each part-files contains up to 100 launchers. This allows to parallelize the compilation over multiple CPU cores.
  
-=== Step 4: Adopt submit.py for your Environment === +=== Step 4: Adapt tune_submit.py to your environment === 
-The script ''submit.py'' was written for the slurm batch system as used e.g. by CRAY supercomputers. If your computer runs a different batch system you have to adopt ''submit.py'' accordingly.+The script ''tune_submit.py'' was written for the slurm batch system as used e.g. by CRAY supercomputers. If your computer runs a different batch system you have to adapt ''tune_submit.py'' accordingly.
  
 === Step 5: Submit Jobs === === Step 5: Submit Jobs ===
 Each tune-directory contains a job file. Each tune-directory contains a job file.
-Sincethere might be many tune-directories the convenience script ''submit.py'' can be used. It will go through all the ''tune_*''-directories and check if it has already been submitted or run. For this the script calls ''squeue'' in the background and it searches for ''slurm-*.out'' files.+Since there might be many tune-directoriesthe convenience script ''tune_submit.py'' can be used to submit jobs. It will go through all the ''tune_*''-directories and check if its job has already been submitted or run. For thisthe script calls ''squeue'' in the background and it searches for ''slurm-*.out'' files.
  
-When ''submit.py'' is called without arguments it will just list the jobs that could be submitted:+When ''tune_submit.py'' is called without argumentsit will just list the jobs that could be submitted:
 <code> <code>
-$ ./submit.py +$ ./tune_submit.py 
           tune_5x5x5: Would submit, run with "doit!"           tune_5x5x5: Would submit, run with "doit!"
           tune_5x5x8: Would submit, run with "doit!"           tune_5x5x8: Would submit, run with "doit!"
Line 89: Line 92:
 </code> </code>
  
-Only when ''submit.py'' is called with ''doit!'' as its first argument it will actually submit jobs:+Only when ''tune_submit.py'' is called with ''doit!'' as its first argumentwill it actually submit jobs:
 <code> <code>
-$ ./submit.py doit!+$ ./tune_submit.py doit!
           tune_5x5x5: Submitting           tune_5x5x5: Submitting
 Submitted batch job 277987 Submitted batch job 277987
Line 112: Line 115:
  
 === Step 5: Collect Results === === Step 5: Collect Results ===
-Run ''collect.py'' to parse all log files and to determine the best kernel for each blocksize:+Run ''tune_collect.py'' to parse all log files and to determine the best kernel for each blocksize:
 <code> <code>
-$ ./collect.py+$ ./tune_collect.py
 Reading: tune_5x5x5/tune_5x5x5_exe0.log Reading: tune_5x5x5/tune_5x5x5_exe0.log
 Reading: tune_5x5x8/tune_5x5x8_exe0.log Reading: tune_5x5x8/tune_5x5x8_exe0.log