User Tools

Site Tools


howto:libcusmm

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revisionBoth sides next revision
howto:libcusmm [2018/12/18 13:36] alazzarohowto:libcusmm [2019/02/06 11:22] – Reflect changes brought by PR #137 to DBCSR repo sjakobovits
Line 1: Line 1:
 ====== Howto Optimize Cuda Kernels for Libcusmm ====== ====== Howto Optimize Cuda Kernels for Libcusmm ======
-=== Step 1: Go to the directory libcusmm directory ===+**Python version required:** python3.6 
 +If you are about to autotune parameters for a new GPU (i.e. a GPU for which there are no autotuned parameters yet), please first follow these instructions.  
 + 
 +=== Step 1: Go to the libcusmm directory ===
 <code> <code>
 $ cd dbcsr/src/acc/libsmm_acc/libcusmm $ cd dbcsr/src/acc/libsmm_acc/libcusmm
 </code> </code>
  
-=== Step 2: Adopt tune.py for your Environment === +=== Step 2: Adapt tune_setup.py to your environment === 
-The ''tune.py'' script generates job files. You have to adopt the script to the environment of your supercomputer and your personal settings.+The ''tune_setup.py'' script generates job files. You have to adapt the script to the environment of your supercomputer and your personal settings.
 <code python> <code python>
 ... ...
Line 31: Line 34:
 </code> </code>
  
-=== Step 3: Run the script tune.py ===+=== Step 3: Run the script tune_setup.py ===
 The script takes as arguments the blocksizes you want to add to libcusmm. For example, if your system contains blocks of size 5 and 8 type: The script takes as arguments the blocksizes you want to add to libcusmm. For example, if your system contains blocks of size 5 and 8 type:
 <code> <code>
-$ ./tune.py 5 8+$ ./tune_setup.py 5 8
 Found 23 parameter sets for 5x5x5 Found 23 parameter sets for 5x5x5
 Found 31 parameter sets for 5x5x8 Found 31 parameter sets for 5x5x8
Line 63: Line 66:
 tune_8x8x8.job tune_8x8x8.job
 </code> </code>
-For each possible parameter-set a //launcher// is generated. A launcher is a small snipped of C code, which launches the kernel by using the cuda specific ''%%<<< >>>%%''-notation. It also instantiates the C++ template which contains the actual kernel code.+For each possible parameter-set a //launcher// is generated. A launcher is a small snippet of C code, which launches the kernel by using the cuda specific ''%%<<< >>>%%''-notation. It also instantiates the C++ template which contains the actual kernel code.
  
-In order to parallelize the benchmarking the launchers are distributed over multiple executables.+In order to parallelize the benchmarkingthe launchers are distributed over multiple executables.
 Currently, up to 10000 launchers are benchmarked by one //executable//. Each executable is linked together from several ''tune_*_part???.o'' and a ''tune_*_main.o''. Each part-files contains up to 100 launchers. This allows to parallelize the compilation over multiple CPU cores. Currently, up to 10000 launchers are benchmarked by one //executable//. Each executable is linked together from several ''tune_*_part???.o'' and a ''tune_*_main.o''. Each part-files contains up to 100 launchers. This allows to parallelize the compilation over multiple CPU cores.
  
-=== Step 4: Adopt submit.py for your Environment === +=== Step 4: Adapt tune_submit.py to your environment === 
-The script ''submit.py'' was written for the slurm batch system as used e.g. by CRAY supercomputers. If your computer runs a different batch system you have to adopt ''submit.py'' accordingly.+The script ''tune_submit.py'' was written for the slurm batch system as used e.g. by CRAY supercomputers. If your computer runs a different batch system you have to adapt ''tune_submit.py'' accordingly.
  
 === Step 5: Submit Jobs === === Step 5: Submit Jobs ===
 Each tune-directory contains a job file. Each tune-directory contains a job file.
-Sincethere might be many tune-directories the convenience script ''submit.py'' can be used. It will go through all the ''tune_*''-directories and check if it has already been submitted or run. For this the script calls ''squeue'' in the background and it searches for ''slurm-*.out'' files.+Since there might be many tune-directoriesthe convenience script ''tune_submit.py'' can be used to submit jobs. It will go through all the ''tune_*''-directories and check if its job has already been submitted or run. For thisthe script calls ''squeue'' in the background and it searches for ''slurm-*.out'' files.
  
-When ''submit.py'' is called without arguments it will just list the jobs that could be submitted:+When ''tune_submit.py'' is called without argumentsit will just list the jobs that could be submitted:
 <code> <code>
-$ ./submit.py +$ ./tune_submit.py 
           tune_5x5x5: Would submit, run with "doit!"           tune_5x5x5: Would submit, run with "doit!"
           tune_5x5x8: Would submit, run with "doit!"           tune_5x5x8: Would submit, run with "doit!"
Line 89: Line 92:
 </code> </code>
  
-Only when ''submit.py'' is called with ''doit!'' as its first argument it will actually submit jobs:+Only when ''tune_submit.py'' is called with ''doit!'' as its first argumentwill it actually submit jobs:
 <code> <code>
-$ ./submit.py doit!+$ ./tune_submit.py doit!
           tune_5x5x5: Submitting           tune_5x5x5: Submitting
 Submitted batch job 277987 Submitted batch job 277987
Line 112: Line 115:
  
 === Step 5: Collect Results === === Step 5: Collect Results ===
-Run ''collect.py'' to parse all log files and to determine the best kernel for each blocksize:+Run ''tune_collect.py'' to parse all log files and to determine the best kernel for each blocksize:
 <code> <code>
-$ ./collect.py+$ ./tune_collect.py
 Reading: tune_5x5x5/tune_5x5x5_exe0.log Reading: tune_5x5x5/tune_5x5x5_exe0.log
 Reading: tune_5x5x8/tune_5x5x8_exe0.log Reading: tune_5x5x8/tune_5x5x8_exe0.log