User Tools

Site Tools


howto:libcusmm

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
howto:libcusmm [2018/12/18 13:36]
alazzaro
howto:libcusmm [2019/02/08 13:50] (current)
sjakobovits describe arguments for tune_setup.y
Line 1: Line 1:
 ====== Howto Optimize Cuda Kernels for Libcusmm ====== ====== Howto Optimize Cuda Kernels for Libcusmm ======
-=== Step 1: Go to the directory ​libcusmm directory ===+**Python version required:** python3.6 
 + 
 +If you are about to autotune parameters for a new GPU (i.e. a GPU for which there are no autotuned parameters yet), please first follow [[https://​github.com/​cp2k/​dbcsr/​tree/​develop/​src/​acc/​libsmm_acc/​libcusmm#​adding-support-for-a-new-gpu-card|the instructions for a new GPU]].  
 + 
 +=== Step 1: Go to the libcusmm directory ===
 <​code>​ <​code>​
 $ cd dbcsr/​src/​acc/​libsmm_acc/​libcusmm $ cd dbcsr/​src/​acc/​libsmm_acc/​libcusmm
 </​code>​ </​code>​
  
-=== Step 2: Adopt tune.py for your Environment ​=== +=== Step 2: Adapt tune_setup.py to your environment ​=== 
-The ''​tune.py''​ script generates job files. You have to adopt the script to the environment of your supercomputer and your personal settings.+The ''​tune_setup.py''​ script generates job files. You have to adapt the script to the environment of your supercomputer and your personal settings.
 <code python> <code python>
 ... ...
-def gen_jobfile(outdir,​ m, n, k): +  ​def gen_jobfile(outdir,​ m, n, k): 
-    t = "/​tune_%dx%dx%d"​%(m,​n,​k) +      t = "/​tune_%dx%dx%d"​ % (m, n, k) 
-    all_exe_src = [basename(fn) for fn in glob(outdir+t+"​_*_main.cu"​)] +      all_exe_src = [os.path.basename(fn) for fn in glob(outdir + t + "​_*_main.cu"​)] 
-    all_exe = sorted([fn.replace("​_main.cu",​ ""​) for fn in all_exe_src])+      all_exe = sorted([fn.replace("​_main.cu",​ ""​) for fn in all_exe_src])
  
-    ​output = "#​!/​bin/​bash -l\n"​ +      ​output = "#​!/​bin/​bash -l\n"​ 
-    output += "#​SBATCH --nodes=%d\n"​%len(all_exe) +      output += "#​SBATCH --nodes=%d\n"​ % len(all_exe) 
-    output += "#​SBATCH --time=0:​30:​00\n"​ +      output += "#​SBATCH --time=0:​30:​00\n"​ 
-    output += "#​SBATCH --account=s441\n" +      output += "#​SBATCH --account=s238\n" 
-    output += "​\n"​ +      ​output += "#​SBATCH --partition=normal\n"​ 
-    output += "​source ${MODULESHOME}/​init/​sh;​\n"​ +      output += "#​SBATCH --constraint=gpu\n"​ 
-    output += "​module unload PrgEnv-cray\n"​ +      ​output += "​\n"​ 
-    output += "​module load cudatoolkit ​PrgEnv-gnu\n"​ +      output += "​source ${MODULESHOME}/​init/​sh;​\n"​ 
-    output += "​module list\n"​ +      ​output += "​module load daint-gpu\n"​ 
-    output += "cd $SLURM_SUBMIT_DIR \n" +      ​output += "​module unload PrgEnv-cray\n"​ 
-    output += "​\n"​ +      output += "​module load PrgEnv-gnu/​6.0.3\n"​ 
-    output += "​date\n"​ +      output += "​module load cudatoolkit/​8.0.54_2.2.8_ga620558-2.1\n" 
-    for exe in all_exe: +      output += "​module list\n"​ 
-        output += "aprun --1 -1 -d 8 make -j 16 %s &​\n"​%exe+      ​output += "​export CRAY_CUDA_MPS=1\n"​ 
 +      ​output += "cd $SLURM_SUBMIT_DIR \n" 
 +      output += "​\n"​ 
 +      output += "​date\n"​ 
 +      for exe in all_exe: 
 +          output += 
 +              ​"srun --nodes=1 --bcast=/​tmp/​${USER} --ntasks=1 --ntasks-per-node=1 --cpus-per-task=12 ​make -j 24 %s &​\n"​ % 
 +              ​exe)
    ...    ...
 +...
 </​code>​ </​code>​
  
-=== Step 3: Run the script ​tune.py === +=== Step 3: Run the script ​tune_setup.py === 
-The script takes as arguments the blocksizes you want to add to libcusmm. For example, if your system contains blocks of size 5 and 8 type:+Specify which GPU you are autotuning for by passing the appropriate ''​parameters_GPU.json''​ file as an argument with ''​-p''​.  
 +In addition, the script takes as arguments the blocksizes you want to add to libcusmm. For example, if the system ​you want to autotune for contains blocks of size 5 and 8, run:
 <​code>​ <​code>​
-$ ./tune.py 5 8+$ ./tune_setup.py 5 8 -p parameters_P100.json
 Found 23 parameter sets for 5x5x5 Found 23 parameter sets for 5x5x5
 Found 31 parameter sets for 5x5x8 Found 31 parameter sets for 5x5x8
Line 63: Line 76:
 tune_8x8x8.job tune_8x8x8.job
 </​code>​ </​code>​
-For each possible parameter-set a //​launcher//​ is generated. A launcher is a small snipped ​of C code, which launches the kernel by using the cuda specific ''​%%<<<​ >>>​%%''​-notation. It also instantiates the C++ template which contains the actual kernel code.+For each possible parameter-set a //​launcher//​ is generated. A launcher is a small snippet ​of C code, which launches the kernel by using the cuda specific ''​%%<<<​ >>>​%%''​-notation. It also instantiates the C++ template which contains the actual kernel code.
  
-In order to parallelize the benchmarking the launchers are distributed over multiple executables. +In order to parallelize the benchmarkingthe launchers are distributed over multiple executables. 
-Currently, up to 10000 launchers are benchmarked by one //​executable//​. Each executable is linked together from several ''​tune_*_part???​.o''​ and a ''​tune_*_main.o''​. Each part-files contains up to 100 launchers. This allows to parallelize the compilation over multiple CPU cores.+Currently, up to 10'​000 ​launchers are benchmarked by one //​executable//​. Each executable is linked together from several ''​tune_*_part???​.o''​ and a ''​tune_*_main.o''​. Each part-files contains up to 100 launchers. This allows to parallelize the compilation over multiple CPU cores.
  
-=== Step 4: Adopt submit.py for your Environment ​=== +=== Step 4: Adapt tune_submit.py to your environment ​=== 
-The script ''​submit.py''​ was written for the slurm batch system as used e.g. by CRAY supercomputers. If your computer runs a different batch system you have to adopt ''​submit.py''​ accordingly.+The script ''​tune_submit.py''​ was written for the slurm batch system as used e.g. by CRAY supercomputers. If your computer runs a different batch systemyou have to adapt ''​tune_submit.py''​ accordingly.
  
 === Step 5: Submit Jobs === === Step 5: Submit Jobs ===
 Each tune-directory contains a job file. Each tune-directory contains a job file.
-Sincethere might be many tune-directories the convenience script ''​submit.py''​ can be used. It will go through all the ''​tune_*''​-directories and check if it has already been submitted or run. For this the script calls ''​squeue''​ in the background and it searches for ''​slurm-*.out''​ files.+Since there might be many tune-directoriesthe convenience script ''​tune_submit.py''​ can be used to submit jobs. It will go through all the ''​tune_*''​-directories and check if its job has already been submitted or run. For thisthe script calls ''​squeue''​ in the background and it searches for ''​slurm-*.out''​ files.
  
-When ''​submit.py''​ is called without arguments it will just list the jobs that could be submitted:+When ''​tune_submit.py''​ is called without argumentsit will just list the jobs that could be submitted:
 <​code>​ <​code>​
-$ ./submit.py +$ ./tune_submit.py 
           tune_5x5x5: Would submit, run with "​doit!"​           tune_5x5x5: Would submit, run with "​doit!"​
           tune_5x5x8: Would submit, run with "​doit!"​           tune_5x5x8: Would submit, run with "​doit!"​
Line 89: Line 102:
 </​code>​ </​code>​
  
-Only when ''​submit.py''​ is called with ''​doit!''​ as its first argument ​it will actually submit jobs:+Only when ''​tune_submit.py''​ is called with ''​doit!''​ as its first argumentwill it actually submit jobs:
 <​code>​ <​code>​
-$ ./submit.py doit!+$ ./tune_submit.py doit!
           tune_5x5x5: Submitting           tune_5x5x5: Submitting
 Submitted batch job 277987 Submitted batch job 277987
Line 111: Line 124:
 </​code>​ </​code>​
  
-=== Step 5: Collect Results === +=== Step 6: Collect Results === 
-Run ''​collect.py''​ to parse all log files and to determine the best kernel for each blocksize:+Run ''​tune_collect.py''​ to parse all log files and determine the best kernel for each blocksize:
 <​code>​ <​code>​
-$ ./collect.py+$ ./tune_collect.py
 Reading: tune_5x5x5/​tune_5x5x5_exe0.log Reading: tune_5x5x5/​tune_5x5x5_exe0.log
 Reading: tune_5x5x8/​tune_5x5x8_exe0.log Reading: tune_5x5x8/​tune_5x5x8_exe0.log
Line 131: Line 144:
 Kernel_dnt_tiny(m=8,​ n=8, k=5, split_thread=32,​ threads=96, grouping=16,​ minblocks=1) , # 62.8469 GFlops ​ Kernel_dnt_tiny(m=8,​ n=8, k=5, split_thread=32,​ threads=96, grouping=16,​ minblocks=1) , # 62.8469 GFlops ​
 Kernel_dnt_tiny(m=8,​ n=8, k=8, split_thread=32,​ threads=128,​ grouping=16,​ minblocks=1) , # 90.7763 GFlops ​ Kernel_dnt_tiny(m=8,​ n=8, k=8, split_thread=32,​ threads=128,​ grouping=16,​ minblocks=1) , # 90.7763 GFlops ​
 +
 +Wrote parameters.json
 </​code>​ </​code>​
 +
 +The file ''​parameters.json''​ now contains the newly autotuned parameters.
 +
 +=== Step 7: Merge new parameters with original parameter-file ===
 +Run ''​tune_merge.py''​ to merge the new parameters with the original ones:
 +<​code>​
 +$ ./​tune_merge.py
 +Merging parameters.json with parameters_P100.json
 +Wrote parameters.new.json
 +</​code>​
 +
 +The file ''​parameters.new.json''​ can now be used as a parameter file. Rename it to ''​parameters_GPU.json'',​ with the appropriate ''​GPU''​. ​
 +
 +=== Step 8: Contribute parameters to the community ===
 +
 +**Contribute new optimal parameters**
 +
 +Submit a pull request updating the appropriate ''​parameters_GPU.json''​ file to the [[https://​github.com/​cp2k/​dbcsr|DBCSR repository]].
 +
 +**Contribute autotuning data**
 +
 +See [[https://​github.com/​cp2k/​dbcsr-data#​contributing|instructions]] in DBCSR'​s [[https://​github.com/​cp2k/​dbcsr-data|data repository]].
 +
howto/libcusmm.1545140187.txt.gz · Last modified: 2018/12/18 13:36 by alazzaro