User Tools

Site Tools


howto:libcusmm

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
howto:libcusmm [2019/04/09 12:45]
alazzaro removed
— (current)
Line 1: Line 1:
-====== Howto Optimize Cuda Kernels for Libcusmm ====== 
-**Python version required:** python3.6 
- 
-If you are about to autotune parameters for a new GPU (i.e. a GPU for which there are no autotuned parameters yet), please first follow [[https://​github.com/​cp2k/​dbcsr/​tree/​develop/​src/​acc/​libsmm_acc/​libcusmm#​adding-support-for-a-new-gpu-card|the instructions for a new GPU]]. ​ 
- 
-=== Step 1: Go to the libcusmm directory === 
-<​code>​ 
-$ cd dbcsr/​src/​acc/​libsmm_acc/​libcusmm 
-</​code>​ 
- 
-=== Step 2: Adapt tune_setup.py to your environment === 
-The ''​tune_setup.py''​ script generates job files. You have to adapt the script to the environment of your supercomputer and your personal settings. 
-<code python> 
-... 
-  def gen_jobfile(outdir,​ m, n, k): 
-      t = "/​tune_%dx%dx%d"​ % (m, n, k) 
-      all_exe_src = [os.path.basename(fn) for fn in glob(outdir + t + "​_*_main.cu"​)] 
-      all_exe = sorted([fn.replace("​_main.cu",​ ""​) for fn in all_exe_src]) 
- 
-      output = "#​!/​bin/​bash -l\n" 
-      output += "#​SBATCH --nodes=%d\n"​ % len(all_exe) 
-      output += "#​SBATCH --time=0:​30:​00\n"​ 
-      output += "#​SBATCH --account=s238\n"​ 
-      output += "#​SBATCH --partition=normal\n"​ 
-      output += "#​SBATCH --constraint=gpu\n"​ 
-      output += "​\n"​ 
-      output += "​source ${MODULESHOME}/​init/​sh;​\n"​ 
-      output += "​module load daint-gpu\n"​ 
-      output += "​module unload PrgEnv-cray\n"​ 
-      output += "​module load PrgEnv-gnu/​6.0.3\n"​ 
-      output += "​module load cudatoolkit/​8.0.54_2.2.8_ga620558-2.1\n"​ 
-      output += "​module list\n"​ 
-      output += "​export CRAY_CUDA_MPS=1\n"​ 
-      output += "cd $SLURM_SUBMIT_DIR \n" 
-      output += "​\n"​ 
-      output += "​date\n"​ 
-      for exe in all_exe: 
-          output += ( 
-              "srun --nodes=1 --bcast=/​tmp/​${USER} --ntasks=1 --ntasks-per-node=1 --cpus-per-task=12 make -j 24 %s &​\n"​ % 
-              exe) 
-   ... 
-... 
-</​code>​ 
- 
-=== Step 3: Run the script tune_setup.py === 
-Specify which GPU you are autotuning for by passing the appropriate ''​parameters_GPU.json''​ file as an argument with ''​-p''​. ​ 
-In addition, the script takes as arguments the blocksizes you want to add to libcusmm. For example, if the system you want to autotune for contains blocks of size 5 and 8, run: 
-<​code>​ 
-$ ./​tune_setup.py 5 8 -p parameters_P100.json 
-Found 23 parameter sets for 5x5x5 
-Found 31 parameter sets for 5x5x8 
-Found 107 parameter sets for 5x8x5 
-Found 171 parameter sets for 5x8x8 
-Found 75 parameter sets for 8x5x5 
-Found 107 parameter sets for 8x5x8 
-Found 248 parameter sets for 8x8x5 
-Found 424 parameter sets for 8x8x8 
-</​code>​ 
- 
-The script will create a directory for each combination of the blocksizes: 
-<​code>​ 
-$ ls -d tune_* 
-tune_5x5x5 ​ tune_5x5x8 ​ tune_5x8x5 ​ tune_5x8x8 ​ tune_8x5x5 ​ tune_8x5x8 ​ tune_8x8x5 ​ tune_8x8x8 
-</​code>​ 
- 
-Each directory contains a number of files: 
-<​code>​ 
-$ ls -1 tune_8x8x8/ 
-Makefile 
-tune_8x8x8_exe0_main.cu 
-tune_8x8x8_exe0_part0.cu 
-tune_8x8x8_exe0_part1.cu 
-tune_8x8x8_exe0_part2.cu 
-tune_8x8x8_exe0_part3.cu 
-tune_8x8x8_exe0_part4.cu 
-tune_8x8x8.job 
-</​code>​ 
-For each possible parameter-set a //​launcher//​ is generated. A launcher is a small snippet of C code, which launches the kernel by using the cuda specific ''​%%<<<​ >>>​%%''​-notation. It also instantiates the C++ template which contains the actual kernel code. 
- 
-In order to parallelize the benchmarking,​ the launchers are distributed over multiple executables. 
-Currently, up to 10'000 launchers are benchmarked by one //​executable//​. Each executable is linked together from several ''​tune_*_part???​.o''​ and a ''​tune_*_main.o''​. Each part-files contains up to 100 launchers. This allows to parallelize the compilation over multiple CPU cores. 
- 
-=== Step 4: Adapt tune_submit.py to your environment === 
-The script ''​tune_submit.py''​ was written for the slurm batch system as used e.g. by CRAY supercomputers. If your computer runs a different batch system, you have to adapt ''​tune_submit.py''​ accordingly. 
- 
-=== Step 5: Submit Jobs === 
-Each tune-directory contains a job file. 
-Since there might be many tune-directories,​ the convenience script ''​tune_submit.py''​ can be used to submit jobs. It will go through all the ''​tune_*''​-directories and check if its job has already been submitted or run. For this, the script calls ''​squeue''​ in the background and it searches for ''​slurm-*.out''​ files. 
- 
-When ''​tune_submit.py''​ is called without arguments, it will just list the jobs that could be submitted: 
-<​code>​ 
-$ ./​tune_submit.py ​ 
-          tune_5x5x5: Would submit, run with "​doit!"​ 
-          tune_5x5x8: Would submit, run with "​doit!"​ 
-          tune_5x8x5: Would submit, run with "​doit!"​ 
-          tune_5x8x8: Would submit, run with "​doit!"​ 
-          tune_8x5x5: Would submit, run with "​doit!"​ 
-          tune_8x5x8: Would submit, run with "​doit!"​ 
-          tune_8x8x5: Would submit, run with "​doit!"​ 
-          tune_8x8x8: Would submit, run with "​doit!"​ 
-Number of jobs submitted: 8 
-</​code>​ 
- 
-Only when ''​tune_submit.py''​ is called with ''​doit!''​ as its first argument, will it actually submit jobs: 
-<​code>​ 
-$ ./​tune_submit.py doit! 
-          tune_5x5x5: Submitting 
-Submitted batch job 277987 
-          tune_5x5x8: Submitting 
-Submitted batch job 277988 
-          tune_5x8x5: Submitting 
-Submitted batch job 277989 
-          tune_5x8x8: Submitting 
-Submitted batch job 277990 
-          tune_8x5x5: Submitting 
-Submitted batch job 277991 
-          tune_8x5x8: Submitting 
-Submitted batch job 277992 
-          tune_8x8x5: Submitting 
-Submitted batch job 277993 
-          tune_8x8x8: Submitting 
-Submitted batch job 277994 
-Number of jobs submitted: 8 
-</​code>​ 
- 
-=== Step 6: Collect Results === 
-Run ''​tune_collect.py''​ to parse all log files and determine the best kernel for each blocksize: 
-<​code>​ 
-$ ./​tune_collect.py 
-Reading: tune_5x5x5/​tune_5x5x5_exe0.log 
-Reading: tune_5x5x8/​tune_5x5x8_exe0.log 
-Reading: tune_5x8x5/​tune_5x8x5_exe0.log 
-Reading: tune_5x8x8/​tune_5x8x8_exe0.log 
-Reading: tune_8x5x5/​tune_8x5x5_exe0.log 
-Reading: tune_8x5x8/​tune_8x5x8_exe0.log 
-Reading: tune_8x8x5/​tune_8x8x5_exe0.log 
-Reading: tune_8x8x8/​tune_8x8x8_exe0.log 
-Kernel_dnt_tiny(m=5,​ n=5, k=5, split_thread=32,​ threads=64, grouping=16,​ minblocks=1) , # 27.9623 GFlops ​ 
-Kernel_dnt_tiny(m=5,​ n=5, k=8, split_thread=32,​ threads=96, grouping=16,​ minblocks=1) , # 37.8978 GFlops 
-Kernel_dnt_medium(m=5,​ n=8, k=5, tile_m=1, tile_n=1, threads=96, grouping=16,​ minblocks=8) , # 32.9231 GFlops ​ 
-Kernel_dnt_tiny(m=5,​ n=8, k=8, split_thread=32,​ threads=96, grouping=16,​ minblocks=1) , # 47.0366 GFlops 
-Kernel_dnt_medium(m=8,​ n=5, k=5, tile_m=1, tile_n=1, threads=96, grouping=16,​ minblocks=12) , # 33.1999 GFlops ​ 
-Kernel_dnt_medium(m=8,​ n=5, k=8, tile_m=1, tile_n=1, threads=96, grouping=16,​ minblocks=12) , # 49.3499 GFlops 
-Kernel_dnt_tiny(m=8,​ n=8, k=5, split_thread=32,​ threads=96, grouping=16,​ minblocks=1) , # 62.8469 GFlops ​ 
-Kernel_dnt_tiny(m=8,​ n=8, k=8, split_thread=32,​ threads=128,​ grouping=16,​ minblocks=1) , # 90.7763 GFlops ​ 
- 
-Wrote parameters.json 
-</​code>​ 
- 
-The file ''​parameters.json''​ now contains the newly autotuned parameters. 
- 
-=== Step 7: Merge new parameters with original parameter-file === 
-Run ''​tune_merge.py''​ to merge the new parameters with the original ones: 
-<​code>​ 
-$ ./​tune_merge.py 
-Merging parameters.json with parameters_P100.json 
-Wrote parameters.new.json 
-</​code>​ 
- 
-The file ''​parameters.new.json''​ can now be used as a parameter file. Rename it to ''​parameters_GPU.json'',​ with the appropriate ''​GPU''​. ​ 
- 
-=== Step 8: Contribute parameters to the community === 
- 
-**Contribute new optimal parameters** 
- 
-Submit a pull request updating the appropriate ''​parameters_GPU.json''​ file to the [[https://​github.com/​cp2k/​dbcsr|DBCSR repository]]. 
- 
-**Contribute autotuning data** 
- 
-See [[https://​github.com/​cp2k/​dbcsr-data#​contributing|instructions]] in DBCSR'​s [[https://​github.com/​cp2k/​dbcsr-data|data repository]]. 
  
howto/libcusmm.1554813910.txt.gz ยท Last modified: 2019/04/09 12:45 by alazzaro