User Tools

Site Tools


howto:libcusmm

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
howto:libcusmm [2018/12/18 13:36]
alazzaro
— (current)
Line 1: Line 1:
-====== Howto Optimize Cuda Kernels for Libcusmm ====== 
-=== Step 1: Go to the directory libcusmm directory === 
-<​code>​ 
-$ cd dbcsr/​src/​acc/​libsmm_acc/​libcusmm 
-</​code>​ 
  
-=== Step 2: Adopt tune.py for your Environment === 
-The ''​tune.py''​ script generates job files. You have to adopt the script to the environment of your supercomputer and your personal settings. 
-<code python> 
-... 
-def gen_jobfile(outdir,​ m, n, k): 
-    t = "/​tune_%dx%dx%d"​%(m,​n,​k) 
-    all_exe_src = [basename(fn) for fn in glob(outdir+t+"​_*_main.cu"​)] 
-    all_exe = sorted([fn.replace("​_main.cu",​ ""​) for fn in all_exe_src]) 
- 
-    output = "#​!/​bin/​bash -l\n" 
-    output += "#​SBATCH --nodes=%d\n"​%len(all_exe) 
-    output += "#​SBATCH --time=0:​30:​00\n"​ 
-    output += "#​SBATCH --account=s441\n"​ 
-    output += "​\n"​ 
-    output += "​source ${MODULESHOME}/​init/​sh;​\n"​ 
-    output += "​module unload PrgEnv-cray\n"​ 
-    output += "​module load cudatoolkit PrgEnv-gnu\n"​ 
-    output += "​module list\n"​ 
-    output += "cd $SLURM_SUBMIT_DIR \n" 
-    output += "​\n"​ 
-    output += "​date\n"​ 
-    for exe in all_exe: 
-        output += "aprun -b -n 1 -N 1 -d 8 make -j 16 %s &​\n"​%exe 
-   ... 
-</​code>​ 
- 
-=== Step 3: Run the script tune.py === 
-The script takes as arguments the blocksizes you want to add to libcusmm. For example, if your system contains blocks of size 5 and 8 type: 
-<​code>​ 
-$ ./tune.py 5 8 
-Found 23 parameter sets for 5x5x5 
-Found 31 parameter sets for 5x5x8 
-Found 107 parameter sets for 5x8x5 
-Found 171 parameter sets for 5x8x8 
-Found 75 parameter sets for 8x5x5 
-Found 107 parameter sets for 8x5x8 
-Found 248 parameter sets for 8x8x5 
-Found 424 parameter sets for 8x8x8 
-</​code>​ 
- 
-The script will create a directory for each combination of the blocksizes: 
-<​code>​ 
-$ ls -d tune_* 
-tune_5x5x5 ​ tune_5x5x8 ​ tune_5x8x5 ​ tune_5x8x8 ​ tune_8x5x5 ​ tune_8x5x8 ​ tune_8x8x5 ​ tune_8x8x8 
-</​code>​ 
- 
-Each directory contains a number of files: 
-<​code>​ 
-$ ls -1 tune_8x8x8/ 
-Makefile 
-tune_8x8x8_exe0_main.cu 
-tune_8x8x8_exe0_part0.cu 
-tune_8x8x8_exe0_part1.cu 
-tune_8x8x8_exe0_part2.cu 
-tune_8x8x8_exe0_part3.cu 
-tune_8x8x8_exe0_part4.cu 
-tune_8x8x8.job 
-</​code>​ 
-For each possible parameter-set a //​launcher//​ is generated. A launcher is a small snipped of C code, which launches the kernel by using the cuda specific ''​%%<<<​ >>>​%%''​-notation. It also instantiates the C++ template which contains the actual kernel code. 
- 
-In order to parallelize the benchmarking the launchers are distributed over multiple executables. 
-Currently, up to 10000 launchers are benchmarked by one //​executable//​. Each executable is linked together from several ''​tune_*_part???​.o''​ and a ''​tune_*_main.o''​. Each part-files contains up to 100 launchers. This allows to parallelize the compilation over multiple CPU cores. 
- 
-=== Step 4: Adopt submit.py for your Environment === 
-The script ''​submit.py''​ was written for the slurm batch system as used e.g. by CRAY supercomputers. If your computer runs a different batch system you have to adopt ''​submit.py''​ accordingly. 
- 
-=== Step 5: Submit Jobs === 
-Each tune-directory contains a job file. 
-Since, there might be many tune-directories the convenience script ''​submit.py''​ can be used. It will go through all the ''​tune_*''​-directories and check if it has already been submitted or run. For this the script calls ''​squeue''​ in the background and it searches for ''​slurm-*.out''​ files. 
- 
-When ''​submit.py''​ is called without arguments it will just list the jobs that could be submitted: 
-<​code>​ 
-$ ./​submit.py ​ 
-          tune_5x5x5: Would submit, run with "​doit!"​ 
-          tune_5x5x8: Would submit, run with "​doit!"​ 
-          tune_5x8x5: Would submit, run with "​doit!"​ 
-          tune_5x8x8: Would submit, run with "​doit!"​ 
-          tune_8x5x5: Would submit, run with "​doit!"​ 
-          tune_8x5x8: Would submit, run with "​doit!"​ 
-          tune_8x8x5: Would submit, run with "​doit!"​ 
-          tune_8x8x8: Would submit, run with "​doit!"​ 
-Number of jobs submitted: 8 
-</​code>​ 
- 
-Only when ''​submit.py''​ is called with ''​doit!''​ as its first argument it will actually submit jobs: 
-<​code>​ 
-$ ./submit.py doit! 
-          tune_5x5x5: Submitting 
-Submitted batch job 277987 
-          tune_5x5x8: Submitting 
-Submitted batch job 277988 
-          tune_5x8x5: Submitting 
-Submitted batch job 277989 
-          tune_5x8x8: Submitting 
-Submitted batch job 277990 
-          tune_8x5x5: Submitting 
-Submitted batch job 277991 
-          tune_8x5x8: Submitting 
-Submitted batch job 277992 
-          tune_8x8x5: Submitting 
-Submitted batch job 277993 
-          tune_8x8x8: Submitting 
-Submitted batch job 277994 
-Number of jobs submitted: 8 
-</​code>​ 
- 
-=== Step 5: Collect Results === 
-Run ''​collect.py''​ to parse all log files and to determine the best kernel for each blocksize: 
-<​code>​ 
-$ ./​collect.py 
-Reading: tune_5x5x5/​tune_5x5x5_exe0.log 
-Reading: tune_5x5x8/​tune_5x5x8_exe0.log 
-Reading: tune_5x8x5/​tune_5x8x5_exe0.log 
-Reading: tune_5x8x8/​tune_5x8x8_exe0.log 
-Reading: tune_8x5x5/​tune_8x5x5_exe0.log 
-Reading: tune_8x5x8/​tune_8x5x8_exe0.log 
-Reading: tune_8x8x5/​tune_8x8x5_exe0.log 
-Reading: tune_8x8x8/​tune_8x8x8_exe0.log 
-Kernel_dnt_tiny(m=5,​ n=5, k=5, split_thread=32,​ threads=64, grouping=16,​ minblocks=1) , # 27.9623 GFlops ​ 
-Kernel_dnt_tiny(m=5,​ n=5, k=8, split_thread=32,​ threads=96, grouping=16,​ minblocks=1) , # 37.8978 GFlops 
-Kernel_dnt_medium(m=5,​ n=8, k=5, tile_m=1, tile_n=1, threads=96, grouping=16,​ minblocks=8) , # 32.9231 GFlops ​ 
-Kernel_dnt_tiny(m=5,​ n=8, k=8, split_thread=32,​ threads=96, grouping=16,​ minblocks=1) , # 47.0366 GFlops 
-Kernel_dnt_medium(m=8,​ n=5, k=5, tile_m=1, tile_n=1, threads=96, grouping=16,​ minblocks=12) , # 33.1999 GFlops ​ 
-Kernel_dnt_medium(m=8,​ n=5, k=8, tile_m=1, tile_n=1, threads=96, grouping=16,​ minblocks=12) , # 49.3499 GFlops 
-Kernel_dnt_tiny(m=8,​ n=8, k=5, split_thread=32,​ threads=96, grouping=16,​ minblocks=1) , # 62.8469 GFlops ​ 
-Kernel_dnt_tiny(m=8,​ n=8, k=8, split_thread=32,​ threads=128,​ grouping=16,​ minblocks=1) , # 90.7763 GFlops ​ 
-</​code>​ 
howto/libcusmm.1545140187.txt.gz ยท Last modified: 2018/12/18 13:36 by alazzaro