User Tools

Site Tools


howto:libcusmm

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
howto:libcusmm [2014/03/28 14:27]
oschuett
howto:libcusmm [2019/04/09 12:45]
alazzaro removed
Line 1: Line 1:
 ====== Howto Optimize Cuda Kernels for Libcusmm ====== ====== Howto Optimize Cuda Kernels for Libcusmm ======
-=== Step 1: Go to the directory libcusmm directory ===+**Python version required:** python3.6 
 + 
 +If you are about to autotune parameters for a new GPU (i.e. a GPU for which there are no autotuned parameters yet), please first follow [[https://github.com/cp2k/dbcsr/tree/develop/src/acc/libsmm_acc/libcusmm#adding-support-for-a-new-gpu-card|the instructions for a new GPU]].  
 + 
 +=== Step 1: Go to the libcusmm directory ===
 <code> <code>
-$ cd $CP2K_ROOT/src/dbcsr/cuda/libcusmm+$ cd dbcsr/src/acc/libsmm_acc/libcusmm 
 +</code> 
 + 
 +=== Step 2: Adapt tune_setup.py to your environment === 
 +The ''tune_setup.py'' script generates job files. You have to adapt the script to the environment of your supercomputer and your personal settings. 
 +<code python> 
 +... 
 +  def gen_jobfile(outdir, m, n, k): 
 +      t = "/tune_%dx%dx%d" % (m, n, k) 
 +      all_exe_src = [os.path.basename(fn) for fn in glob(outdir + t + "_*_main.cu")] 
 +      all_exe = sorted([fn.replace("_main.cu", "") for fn in all_exe_src]) 
 + 
 +      output = "#!/bin/bash -l\n" 
 +      output += "#SBATCH --nodes=%d\n" % len(all_exe) 
 +      output += "#SBATCH --time=0:30:00\n" 
 +      output += "#SBATCH --account=s238\n" 
 +      output += "#SBATCH --partition=normal\n" 
 +      output += "#SBATCH --constraint=gpu\n" 
 +      output += "\n" 
 +      output += "source ${MODULESHOME}/init/sh;\n" 
 +      output += "module load daint-gpu\n" 
 +      output += "module unload PrgEnv-cray\n" 
 +      output += "module load PrgEnv-gnu/6.0.3\n" 
 +      output += "module load cudatoolkit/8.0.54_2.2.8_ga620558-2.1\n" 
 +      output += "module list\n" 
 +      output += "export CRAY_CUDA_MPS=1\n" 
 +      output += "cd $SLURM_SUBMIT_DIR \n" 
 +      output += "\n" 
 +      output += "date\n" 
 +      for exe in all_exe: 
 +          output += ( 
 +              "srun --nodes=1 --bcast=/tmp/${USER} --ntasks=1 --ntasks-per-node=1 --cpus-per-task=12 make -j 24 %s &\n"
 +              exe) 
 +   ... 
 +...
 </code> </code>
  
-=== Step 2: Run the script tune.py === +=== Step 3: Run the script tune_setup.py === 
-The script takes as arguments the blocksizes you want to add to libcusmm. For example if your system contains the blocks of size 5 and 8 type:+Specify which GPU you are autotuning for by passing the appropriate ''parameters_GPU.json'' file as an argument with ''-p''.  
 +In addition, the script takes as arguments the blocksizes you want to add to libcusmm. For exampleif the system you want to autotune for contains blocks of size 5 and 8, run:
 <code> <code>
-$ ./tune.py 5 8+$ ./tune_setup.py 5 8 -p parameters_P100.json
 Found 23 parameter sets for 5x5x5 Found 23 parameter sets for 5x5x5
 Found 31 parameter sets for 5x5x8 Found 31 parameter sets for 5x5x8
Line 37: Line 76:
 tune_8x8x8.job tune_8x8x8.job
 </code> </code>
-For each possible parameter set a //launcher// is generated. A launcher is a small snipped of C code, which launches the kernel using the cuda specifica ''<<< >>>''-notation . It also instantiates the C++ template which contains the actual kernel code.+For each possible parameter-set a //launcher// is generated. A launcher is a small snippet of C code, which launches the kernel by using the cuda specific ''%%<<< >>>%%''-notation. It also instantiates the C++ template which contains the actual kernel code.
  
-In order to parallelize the compilation and the benchmarking the launchers are distributed over several files+In order to parallelize the benchmarkingthe launchers are distributed over multiple executables
-Currently, up to 10000 launchers are compiled into one //executable//. Each executable is linked together from several //parts// and a ''tune_*_main.o'' . Each parts contains up to 100 launchers and is compiled into a separate object file ''tune_*_part???.o''.+Currently, up to 10'000 launchers are benchmarked by one //executable//. Each executable is linked together from several ''tune_*_part???.o'' and a ''tune_*_main.o''. Each part-files contains up to 100 launchers. This allows to parallelize the compilation over multiple CPU cores.
  
-=== Step 3: Submit Jobs ===+=== Step 4: Adapt tune_submit.py to your environment === 
 +The script ''tune_submit.py'' was written for the slurm batch system as used e.g. by CRAY supercomputers. If your computer runs a different batch system, you have to adapt ''tune_submit.py'' accordingly. 
 + 
 +=== Step 5: Submit Jobs ===
 Each tune-directory contains a job file. Each tune-directory contains a job file.
-Sincethere might be many tune-directories the convince script ''submit.py'' can be used. It will go through all the ''tune_*''-directories and check if it has already been submited or run. For this the script calls ''squeue'' in the background and it searches for ''slurm-*.out'' files.+Since there might be many tune-directoriesthe convenience script ''tune_submit.py'' can be used to submit jobs. It will go through all the ''tune_*''-directories and check if its job has already been submitted or run. For thisthe script calls ''squeue'' in the background and it searches for ''slurm-*.out'' files.
  
-When ''submit.py'' is called without arguments it will just list the jobs that could be submitted:+When ''tune_submit.py'' is called without argumentsit will just list the jobs that could be submitted:
 <code> <code>
-$ ./submit.py +$ ./tune_submit.py 
           tune_5x5x5: Would submit, run with "doit!"           tune_5x5x5: Would submit, run with "doit!"
           tune_5x5x8: Would submit, run with "doit!"           tune_5x5x8: Would submit, run with "doit!"
Line 60: Line 102:
 </code> </code>
  
-Only when ''submit.py'' is called with ''doit!'' as its first argument it will actually submit job:+Only when ''tune_submit.py'' is called with ''doit!'' as its first argumentwill it actually submit jobs:
 <code> <code>
-$ ./submit.py doit!+$ ./tune_submit.py doit!
           tune_5x5x5: Submitting           tune_5x5x5: Submitting
 Submitted batch job 277987 Submitted batch job 277987
Line 82: Line 124:
 </code> </code>
  
-=== Step 4: Collect Results === +=== Step 6: Collect Results === 
-Run ''collect.py'' to parse all log files and to determine the best kernel for each blocksize:+Run ''tune_collect.py'' to parse all log files and determine the best kernel for each blocksize:
 <code> <code>
-$ ./collect.py+$ ./tune_collect.py
 Reading: tune_5x5x5/tune_5x5x5_exe0.log Reading: tune_5x5x5/tune_5x5x5_exe0.log
 Reading: tune_5x5x8/tune_5x5x8_exe0.log Reading: tune_5x5x8/tune_5x5x8_exe0.log
Line 102: Line 144:
 Kernel_dnt_tiny(m=8, n=8, k=5, split_thread=32, threads=96, grouping=16, minblocks=1) , # 62.8469 GFlops  Kernel_dnt_tiny(m=8, n=8, k=5, split_thread=32, threads=96, grouping=16, minblocks=1) , # 62.8469 GFlops 
 Kernel_dnt_tiny(m=8, n=8, k=8, split_thread=32, threads=128, grouping=16, minblocks=1) , # 90.7763 GFlops  Kernel_dnt_tiny(m=8, n=8, k=8, split_thread=32, threads=128, grouping=16, minblocks=1) , # 90.7763 GFlops 
 +
 +Wrote parameters.json
 </code> </code>
 +
 +The file ''parameters.json'' now contains the newly autotuned parameters.
 +
 +=== Step 7: Merge new parameters with original parameter-file ===
 +Run ''tune_merge.py'' to merge the new parameters with the original ones:
 +<code>
 +$ ./tune_merge.py
 +Merging parameters.json with parameters_P100.json
 +Wrote parameters.new.json
 +</code>
 +
 +The file ''parameters.new.json'' can now be used as a parameter file. Rename it to ''parameters_GPU.json'', with the appropriate ''GPU''
 +
 +=== Step 8: Contribute parameters to the community ===
 +
 +**Contribute new optimal parameters**
 +
 +Submit a pull request updating the appropriate ''parameters_GPU.json'' file to the [[https://github.com/cp2k/dbcsr|DBCSR repository]].
 +
 +**Contribute autotuning data**
 +
 +See [[https://github.com/cp2k/dbcsr-data#contributing|instructions]] in DBCSR's [[https://github.com/cp2k/dbcsr-data|data repository]].
 +