User Tools

Site Tools



This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
howto:libcusmm [2014/03/28 14:28]
— (current)
Line 1: Line 1:
-====== Howto Optimize Cuda Kernels for Libcusmm ====== 
-=== Step 1: Go to the directory libcusmm directory === 
-$ cd $CP2K_ROOT/src/dbcsr/cuda/libcusmm 
-=== Step 2: Run the script === 
-The script takes as arguments the blocksizes you want to add to libcusmm. For example, if your system contains blocks of size 5 and 8 type: 
-$ ./ 5 8 
-Found 23 parameter sets for 5x5x5 
-Found 31 parameter sets for 5x5x8 
-Found 107 parameter sets for 5x8x5 
-Found 171 parameter sets for 5x8x8 
-Found 75 parameter sets for 8x5x5 
-Found 107 parameter sets for 8x5x8 
-Found 248 parameter sets for 8x8x5 
-Found 424 parameter sets for 8x8x8 
-The script will create a directory for each combination of the blocksizes: 
-$ ls -d tune_* 
-tune_5x5x5  tune_5x5x8  tune_5x8x5  tune_5x8x8  tune_8x5x5  tune_8x5x8  tune_8x8x5  tune_8x8x8 
-Each directory contains a number of files: 
-$ ls -1 tune_8x8x8/ 
-For each possible parameter set a //launcher// is generated. A launcher is a small snipped of C code, which launches the kernel using the cuda specifica ''<<< >>>''-notation . It also instantiates the C++ template which contains the actual kernel code. 
-In order to parallelize the compilation and the benchmarking the launchers are distributed over several files. 
-Currently, up to 10000 launchers are compiled into one //executable//. Each executable is linked together from several //parts// and a ''tune_*_main.o'' . Each parts contains up to 100 launchers and is compiled into a separate object file ''tune_*_part???.o''. 
-=== Step 3: Submit Jobs === 
-Each tune-directory contains a job file. 
-Since, there might be many tune-directories the convince script '''' can be used. It will go through all the ''tune_*''-directories and check if it has already been submited or run. For this the script calls ''squeue'' in the background and it searches for ''slurm-*.out'' files. 
-When '''' is called without arguments it will just list the jobs that could be submitted: 
-$ ./  
-          tune_5x5x5: Would submit, run with "doit!" 
-          tune_5x5x8: Would submit, run with "doit!" 
-          tune_5x8x5: Would submit, run with "doit!" 
-          tune_5x8x8: Would submit, run with "doit!" 
-          tune_8x5x5: Would submit, run with "doit!" 
-          tune_8x5x8: Would submit, run with "doit!" 
-          tune_8x8x5: Would submit, run with "doit!" 
-          tune_8x8x8: Would submit, run with "doit!" 
-Number of jobs submitted: 8 
-Only when '''' is called with ''doit!'' as its first argument it will actually submit job: 
-$ ./ doit! 
-          tune_5x5x5: Submitting 
-Submitted batch job 277987 
-          tune_5x5x8: Submitting 
-Submitted batch job 277988 
-          tune_5x8x5: Submitting 
-Submitted batch job 277989 
-          tune_5x8x8: Submitting 
-Submitted batch job 277990 
-          tune_8x5x5: Submitting 
-Submitted batch job 277991 
-          tune_8x5x8: Submitting 
-Submitted batch job 277992 
-          tune_8x8x5: Submitting 
-Submitted batch job 277993 
-          tune_8x8x8: Submitting 
-Submitted batch job 277994 
-Number of jobs submitted: 8 
-=== Step 4: Collect Results === 
-Run '''' to parse all log files and to determine the best kernel for each blocksize: 
-$ ./ 
-Reading: tune_5x5x5/tune_5x5x5_exe0.log 
-Reading: tune_5x5x8/tune_5x5x8_exe0.log 
-Reading: tune_5x8x5/tune_5x8x5_exe0.log 
-Reading: tune_5x8x8/tune_5x8x8_exe0.log 
-Reading: tune_8x5x5/tune_8x5x5_exe0.log 
-Reading: tune_8x5x8/tune_8x5x8_exe0.log 
-Reading: tune_8x8x5/tune_8x8x5_exe0.log 
-Reading: tune_8x8x8/tune_8x8x8_exe0.log 
-Kernel_dnt_tiny(m=5, n=5, k=5, split_thread=32, threads=64, grouping=16, minblocks=1) , # 27.9623 GFlops  
-Kernel_dnt_tiny(m=5, n=5, k=8, split_thread=32, threads=96, grouping=16, minblocks=1) , # 37.8978 GFlops 
-Kernel_dnt_medium(m=5, n=8, k=5, tile_m=1, tile_n=1, threads=96, grouping=16, minblocks=8) , # 32.9231 GFlops  
-Kernel_dnt_tiny(m=5, n=8, k=8, split_thread=32, threads=96, grouping=16, minblocks=1) , # 47.0366 GFlops 
-Kernel_dnt_medium(m=8, n=5, k=5, tile_m=1, tile_n=1, threads=96, grouping=16, minblocks=12) , # 33.1999 GFlops  
-Kernel_dnt_medium(m=8, n=5, k=8, tile_m=1, tile_n=1, threads=96, grouping=16, minblocks=12) , # 49.3499 GFlops 
-Kernel_dnt_tiny(m=8, n=8, k=5, split_thread=32, threads=96, grouping=16, minblocks=1) , # 62.8469 GFlops  
-Kernel_dnt_tiny(m=8, n=8, k=8, split_thread=32, threads=128, grouping=16, minblocks=1) , # 90.7763 GFlops  
howto/libcusmm.1396016895.txt.gz ยท Last modified: 2020/08/21 10:15 (external edit)