This is an old revision of the document!

Howto Optimize Cuda Kernels for Libcusmm

Step 1: Go to the directory libcusmm directory

$ cd $CP2K_ROOT/src/dbcsr/cuda/libcusmm

Step 2: Run the script tune.py

The script takes as arguments the blocksizes you want to add to libcusmm. For example, if your system contains blocks of size 5 and 8 type:

$ ./tune.py 5 8
Found 23 parameter sets for 5x5x5
Found 31 parameter sets for 5x5x8
Found 107 parameter sets for 5x8x5
Found 171 parameter sets for 5x8x8
Found 75 parameter sets for 8x5x5
Found 107 parameter sets for 8x5x8
Found 248 parameter sets for 8x8x5
Found 424 parameter sets for 8x8x8

The script will create a directory for each combination of the blocksizes:

$ ls -d tune_*
tune_5x5x5  tune_5x5x8  tune_5x8x5  tune_5x8x8  tune_8x5x5  tune_8x5x8  tune_8x8x5  tune_8x8x8

Each directory contains a number of files:

$ ls -1 tune_8x8x8/
Makefile
tune_8x8x8_exe0_main.cu
tune_8x8x8_exe0_part0.cu
tune_8x8x8_exe0_part1.cu
tune_8x8x8_exe0_part2.cu
tune_8x8x8_exe0_part3.cu
tune_8x8x8_exe0_part4.cu
tune_8x8x8.job

For each possible parameter set a launcher is generated. A launcher is a small snipped of C code, which launches the kernel using the cuda specifica «< »>-notation . It also instantiates the C++ template which contains the actual kernel code.

In order to parallelize the compilation and the benchmarking the launchers are distributed over several files. Currently, up to 10000 launchers are compiled into one executable. Each executable is linked together from several parts and a tune_*_main.o . Each parts contains up to 100 launchers and is compiled into a separate object file tune_*_part???.o.

Step 3: Submit Jobs

Each tune-directory contains a job file. Since, there might be many tune-directories the convince script submit.py can be used. It will go through all the tune_*-directories and check if it has already been submited or run. For this the script calls squeue in the background and it searches for slurm-*.out files.

When submit.py is called without arguments it will just list the jobs that could be submitted:

$ ./submit.py 
          tune_5x5x5: Would submit, run with "doit!"
          tune_5x5x8: Would submit, run with "doit!"
          tune_5x8x5: Would submit, run with "doit!"
          tune_5x8x8: Would submit, run with "doit!"
          tune_8x5x5: Would submit, run with "doit!"
          tune_8x5x8: Would submit, run with "doit!"
          tune_8x8x5: Would submit, run with "doit!"
          tune_8x8x8: Would submit, run with "doit!"
Number of jobs submitted: 8

Only when submit.py is called with doit! as its first argument it will actually submit job:

$ ./submit.py doit!
          tune_5x5x5: Submitting
Submitted batch job 277987
          tune_5x5x8: Submitting
Submitted batch job 277988
          tune_5x8x5: Submitting
Submitted batch job 277989
          tune_5x8x8: Submitting
Submitted batch job 277990
          tune_8x5x5: Submitting
Submitted batch job 277991
          tune_8x5x8: Submitting
Submitted batch job 277992
          tune_8x8x5: Submitting
Submitted batch job 277993
          tune_8x8x8: Submitting
Submitted batch job 277994
Number of jobs submitted: 8

Step 4: Collect Results

Run collect.py to parse all log files and to determine the best kernel for each blocksize:

$ ./collect.py
Reading: tune_5x5x5/tune_5x5x5_exe0.log
Reading: tune_5x5x8/tune_5x5x8_exe0.log
Reading: tune_5x8x5/tune_5x8x5_exe0.log
Reading: tune_5x8x8/tune_5x8x8_exe0.log
Reading: tune_8x5x5/tune_8x5x5_exe0.log
Reading: tune_8x5x8/tune_8x5x8_exe0.log
Reading: tune_8x8x5/tune_8x8x5_exe0.log
Reading: tune_8x8x8/tune_8x8x8_exe0.log
Kernel_dnt_tiny(m=5, n=5, k=5, split_thread=32, threads=64, grouping=16, minblocks=1) , # 27.9623 GFlops 
Kernel_dnt_tiny(m=5, n=5, k=8, split_thread=32, threads=96, grouping=16, minblocks=1) , # 37.8978 GFlops
Kernel_dnt_medium(m=5, n=8, k=5, tile_m=1, tile_n=1, threads=96, grouping=16, minblocks=8) , # 32.9231 GFlops 
Kernel_dnt_tiny(m=5, n=8, k=8, split_thread=32, threads=96, grouping=16, minblocks=1) , # 47.0366 GFlops
Kernel_dnt_medium(m=8, n=5, k=5, tile_m=1, tile_n=1, threads=96, grouping=16, minblocks=12) , # 33.1999 GFlops 
Kernel_dnt_medium(m=8, n=5, k=8, tile_m=1, tile_n=1, threads=96, grouping=16, minblocks=12) , # 49.3499 GFlops
Kernel_dnt_tiny(m=8, n=8, k=5, split_thread=32, threads=96, grouping=16, minblocks=1) , # 62.8469 GFlops 
Kernel_dnt_tiny(m=8, n=8, k=8, split_thread=32, threads=128, grouping=16, minblocks=1) , # 90.7763 GFlops