Differences

This shows you the differences between two versions of the page.

--- howto:libcusmm [2018/12/18 13:36] – alazzaro
+++ howto:libcusmm [2019/04/09 12:45] (current) – removed alazzaro
@@ Line 1: / Line 1: @@
-====== Howto Optimize Cuda Kernels for Libcusmm ======
-=== Step 1: Go to the directory libcusmm directory ===
-<code>
-$ cd dbcsr/src/acc/libsmm_acc/libcusmm
-</code>
-=== Step 2: Adopt tune.py for your Environment ===
-The ''tune.py'' script generates job files. You have to adopt the script to the environment of your supercomputer and your personal settings.
-<code python>
-...
-def gen_jobfile(outdir, m, n, k):
-    t = "/tune_%dx%dx%d"%(m,n,k)
-    all_exe_src = [basename(fn) for fn in glob(outdir+t+"_*_main.cu")]
-    all_exe = sorted([fn.replace("_main.cu", "") for fn in all_exe_src])
-    output = "#!/bin/bash -l\n"
-    output += "#SBATCH --nodes=%d\n"%len(all_exe)
-    output += "#SBATCH --time=0:30:00\n"
-    output += "#SBATCH --account=s441\n"
-    output += "\n"
-    output += "source ${MODULESHOME}/init/sh;\n"
-    output += "module unload PrgEnv-cray\n"
-    output += "module load cudatoolkit PrgEnv-gnu\n"
-    output += "module list\n"
-    output += "cd $SLURM_SUBMIT_DIR \n"
-    output += "\n"
-    output += "date\n"
-    for exe in all_exe:
-        output += "aprun -b -n 1 -N 1 -d 8 make -j 16 %s &\n"%exe
-   ...
-</code>
-=== Step 3: Run the script tune.py ===
-The script takes as arguments the blocksizes you want to add to libcusmm. For example, if your system contains blocks of size 5 and 8 type:
-<code>
-$ ./tune.py 5 8
-Found 23 parameter sets for 5x5x5
-Found 31 parameter sets for 5x5x8
-Found 107 parameter sets for 5x8x5
-Found 171 parameter sets for 5x8x8
-Found 75 parameter sets for 8x5x5
-Found 107 parameter sets for 8x5x8
-Found 248 parameter sets for 8x8x5
-Found 424 parameter sets for 8x8x8
-</code>
-The script will create a directory for each combination of the blocksizes:
-<code>
-$ ls -d tune_*
-tune_5x5x5  tune_5x5x8  tune_5x8x5  tune_5x8x8  tune_8x5x5  tune_8x5x8  tune_8x8x5  tune_8x8x8
-</code>
-Each directory contains a number of files:
-<code>
-$ ls -1 tune_8x8x8/
-Makefile
-tune_8x8x8_exe0_main.cu
-tune_8x8x8_exe0_part0.cu
-tune_8x8x8_exe0_part1.cu
-tune_8x8x8_exe0_part2.cu
-tune_8x8x8_exe0_part3.cu
-tune_8x8x8_exe0_part4.cu
-tune_8x8x8.job
-</code>
-For each possible parameter-set a //launcher// is generated. A launcher is a small snipped of C code, which launches the kernel by using the cuda specific ''%%<<< >>>%%''-notation. It also instantiates the C++ template which contains the actual kernel code.
-In order to parallelize the benchmarking the launchers are distributed over multiple executables.
-Currently, up to 10000 launchers are benchmarked by one //executable//. Each executable is linked together from several ''tune_*_part???.o'' and a ''tune_*_main.o''. Each part-files contains up to 100 launchers. This allows to parallelize the compilation over multiple CPU cores.
-=== Step 4: Adopt submit.py for your Environment ===
-The script ''submit.py'' was written for the slurm batch system as used e.g. by CRAY supercomputers. If your computer runs a different batch system you have to adopt ''submit.py'' accordingly.
-=== Step 5: Submit Jobs ===
-Each tune-directory contains a job file.
-Since, there might be many tune-directories the convenience script ''submit.py'' can be used. It will go through all the ''tune_*''-directories and check if it has already been submitted or run. For this the script calls ''squeue'' in the background and it searches for ''slurm-*.out'' files.
-When ''submit.py'' is called without arguments it will just list the jobs that could be submitted:
-<code>
-$ ./submit.py
-          tune_5x5x5: Would submit, run with "doit!"
-          tune_5x5x8: Would submit, run with "doit!"
-          tune_5x8x5: Would submit, run with "doit!"
-          tune_5x8x8: Would submit, run with "doit!"
-          tune_8x5x5: Would submit, run with "doit!"
-          tune_8x5x8: Would submit, run with "doit!"
-          tune_8x8x5: Would submit, run with "doit!"
-          tune_8x8x8: Would submit, run with "doit!"
-Number of jobs submitted: 8
-</code>
-Only when ''submit.py'' is called with ''doit!'' as its first argument it will actually submit jobs:
-<code>
-$ ./submit.py doit!
-          tune_5x5x5: Submitting
-Submitted batch job 277987
-          tune_5x5x8: Submitting
-Submitted batch job 277988
-          tune_5x8x5: Submitting
-Submitted batch job 277989
-          tune_5x8x8: Submitting
-Submitted batch job 277990
-          tune_8x5x5: Submitting
-Submitted batch job 277991
-          tune_8x5x8: Submitting
-Submitted batch job 277992
-          tune_8x8x5: Submitting
-Submitted batch job 277993
-          tune_8x8x8: Submitting
-Submitted batch job 277994
-Number of jobs submitted: 8
-</code>
-=== Step 5: Collect Results ===
-Run ''collect.py'' to parse all log files and to determine the best kernel for each blocksize:
-<code>
-$ ./collect.py
-Reading: tune_5x5x5/tune_5x5x5_exe0.log
-Reading: tune_5x5x8/tune_5x5x8_exe0.log
-Reading: tune_5x8x5/tune_5x8x5_exe0.log
-Reading: tune_5x8x8/tune_5x8x8_exe0.log
-Reading: tune_8x5x5/tune_8x5x5_exe0.log
-Reading: tune_8x5x8/tune_8x5x8_exe0.log
-Reading: tune_8x8x5/tune_8x8x5_exe0.log
-Reading: tune_8x8x8/tune_8x8x8_exe0.log
-Kernel_dnt_tiny(m=5, n=5, k=5, split_thread=32, threads=64, grouping=16, minblocks=1) , # 27.9623 GFlops
-Kernel_dnt_tiny(m=5, n=5, k=8, split_thread=32, threads=96, grouping=16, minblocks=1) , # 37.8978 GFlops
-Kernel_dnt_medium(m=5, n=8, k=5, tile_m=1, tile_n=1, threads=96, grouping=16, minblocks=8) , # 32.9231 GFlops
-Kernel_dnt_tiny(m=5, n=8, k=8, split_thread=32, threads=96, grouping=16, minblocks=1) , # 47.0366 GFlops
-Kernel_dnt_medium(m=8, n=5, k=5, tile_m=1, tile_n=1, threads=96, grouping=16, minblocks=12) , # 33.1999 GFlops
-Kernel_dnt_medium(m=8, n=5, k=8, tile_m=1, tile_n=1, threads=96, grouping=16, minblocks=12) , # 49.3499 GFlops
-Kernel_dnt_tiny(m=8, n=8, k=5, split_thread=32, threads=96, grouping=16, minblocks=1) , # 62.8469 GFlops
-Kernel_dnt_tiny(m=8, n=8, k=8, split_thread=32, threads=128, grouping=16, minblocks=1) , # 90.7763 GFlops
-</code>