User Tools

Site Tools


howto:libcusmm

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
Last revisionBoth sides next revision
howto:libcusmm [2014/10/27 18:16] oschuetthowto:libcusmm [2019/02/08 13:50] – describe arguments for tune_setup.y sjakobovits
Line 1: Line 1:
 ====== Howto Optimize Cuda Kernels for Libcusmm ====== ====== Howto Optimize Cuda Kernels for Libcusmm ======
-=== Step 1: Go to the directory libcusmm directory ===+**Python version required:** python3.6 
 + 
 +If you are about to autotune parameters for a new GPU (i.e. a GPU for which there are no autotuned parameters yet), please first follow [[https://github.com/cp2k/dbcsr/tree/develop/src/acc/libsmm_acc/libcusmm#adding-support-for-a-new-gpu-card|the instructions for a new GPU]].  
 + 
 +=== Step 1: Go to the libcusmm directory ===
 <code> <code>
-$ cd $CP2K_ROOT/src/dbcsr/libsmm_acc/libcusmm+$ cd dbcsr/src/acc/libsmm_acc/libcusmm
 </code> </code>
  
-=== Step 2: Adopt tune.py for your Environment === +=== Step 2: Adapt tune_setup.py to your environment === 
-The ''tune.py'' script generates job files. You have to adopt the script to the environment of your supercomputer and your personal settings.+The ''tune_setup.py'' script generates job files. You have to adapt the script to the environment of your supercomputer and your personal settings.
 <code python> <code python>
 ... ...
-def gen_jobfile(outdir, m, n, k): +  def gen_jobfile(outdir, m, n, k): 
-    t = "/tune_%dx%dx%d"%(m,n,k) +      t = "/tune_%dx%dx%d" % (m, n, k) 
-    all_exe_src = [basename(fn) for fn in glob(outdir+t+"_*_main.cu")] +      all_exe_src = [os.path.basename(fn) for fn in glob(outdir + t + "_*_main.cu")] 
-    all_exe = sorted([fn.replace("_main.cu", "") for fn in all_exe_src])+      all_exe = sorted([fn.replace("_main.cu", "") for fn in all_exe_src])
  
-    output = "#!/bin/bash -l\n" +      output = "#!/bin/bash -l\n" 
-    output += "#SBATCH --nodes=%d\n"%len(all_exe) +      output += "#SBATCH --nodes=%d\n" % len(all_exe) 
-    output += "#SBATCH --time=0:30:00\n" +      output += "#SBATCH --time=0:30:00\n" 
-    output += "#SBATCH --account=s441\n" +      output += "#SBATCH --account=s238\n" 
-    output += "\n" +      output += "#SBATCH --partition=normal\n" 
-    output += "source ${MODULESHOME}/init/sh;\n" +      output += "#SBATCH --constraint=gpu\n" 
-    output += "module unload PrgEnv-cray\n" +      output += "\n" 
-    output += "module load cudatoolkit PrgEnv-gnu\n" +      output += "source ${MODULESHOME}/init/sh;\n" 
-    output += "module list\n" +      output += "module load daint-gpu\n" 
-    output += "cd $SLURM_SUBMIT_DIR \n" +      output += "module unload PrgEnv-cray\n" 
-    output += "\n" +      output += "module load PrgEnv-gnu/6.0.3\n" 
-    output += "date\n" +      output += "module load cudatoolkit/8.0.54_2.2.8_ga620558-2.1\n" 
-    for exe in all_exe: +      output += "module list\n" 
-        output += "aprun --1 -1 -d 8 make -j 16 %s &\n"%exe+      output += "export CRAY_CUDA_MPS=1\n" 
 +      output += "cd $SLURM_SUBMIT_DIR \n" 
 +      output += "\n" 
 +      output += "date\n" 
 +      for exe in all_exe: 
 +          output += 
 +              "srun --nodes=1 --bcast=/tmp/${USER} --ntasks=1 --ntasks-per-node=1 --cpus-per-task=12 make -j 24 %s &\n" % 
 +              exe)
    ...    ...
 +...
 </code> </code>
  
-=== Step 3: Run the script tune.py === +=== Step 3: Run the script tune_setup.py === 
-The script takes as arguments the blocksizes you want to add to libcusmm. For example, if your system contains blocks of size 5 and 8 type:+Specify which GPU you are autotuning for by passing the appropriate ''parameters_GPU.json'' file as an argument with ''-p''.  
 +In addition, the script takes as arguments the blocksizes you want to add to libcusmm. For example, if the system you want to autotune for contains blocks of size 5 and 8, run:
 <code> <code>
-$ ./tune.py 5 8+$ ./tune_setup.py 5 8 -p parameters_P100.json
 Found 23 parameter sets for 5x5x5 Found 23 parameter sets for 5x5x5
 Found 31 parameter sets for 5x5x8 Found 31 parameter sets for 5x5x8
Line 63: Line 76:
 tune_8x8x8.job tune_8x8x8.job
 </code> </code>
-For each possible parameter-set a //launcher// is generated. A launcher is a small snipped of C code, which launches the kernel by using the cuda specific ''%%<<< >>>%%''-notation. It also instantiates the C++ template which contains the actual kernel code.+For each possible parameter-set a //launcher// is generated. A launcher is a small snippet of C code, which launches the kernel by using the cuda specific ''%%<<< >>>%%''-notation. It also instantiates the C++ template which contains the actual kernel code.
  
-In order to parallelize the benchmarking the launchers are distributed over multiple executables. +In order to parallelize the benchmarkingthe launchers are distributed over multiple executables. 
-Currently, up to 10000 launchers are benchmarked by one //executable//. Each executable is linked together from several ''tune_*_part???.o'' and a ''tune_*_main.o''. Each part-files contains up to 100 launchers. This allows to parallelize the compilation over multiple CPU cores.+Currently, up to 10'000 launchers are benchmarked by one //executable//. Each executable is linked together from several ''tune_*_part???.o'' and a ''tune_*_main.o''. Each part-files contains up to 100 launchers. This allows to parallelize the compilation over multiple CPU cores.
  
-=== Step 4: Adopt submit.py for your Environment === +=== Step 4: Adapt tune_submit.py to your environment === 
-The script ''submit.py'' was written for the slurm batch system as used e.g. by CRAY supercomputers. If your computer runs a different batch system you have to adopt ''submit.py'' accordingly.+The script ''tune_submit.py'' was written for the slurm batch system as used e.g. by CRAY supercomputers. If your computer runs a different batch systemyou have to adapt ''tune_submit.py'' accordingly.
  
 === Step 5: Submit Jobs === === Step 5: Submit Jobs ===
 Each tune-directory contains a job file. Each tune-directory contains a job file.
-Sincethere might be many tune-directories the convenience script ''submit.py'' can be used. It will go through all the ''tune_*''-directories and check if it has already been submitted or run. For this the script calls ''squeue'' in the background and it searches for ''slurm-*.out'' files.+Since there might be many tune-directoriesthe convenience script ''tune_submit.py'' can be used to submit jobs. It will go through all the ''tune_*''-directories and check if its job has already been submitted or run. For thisthe script calls ''squeue'' in the background and it searches for ''slurm-*.out'' files.
  
-When ''submit.py'' is called without arguments it will just list the jobs that could be submitted:+When ''tune_submit.py'' is called without argumentsit will just list the jobs that could be submitted:
 <code> <code>
-$ ./submit.py +$ ./tune_submit.py 
           tune_5x5x5: Would submit, run with "doit!"           tune_5x5x5: Would submit, run with "doit!"
           tune_5x5x8: Would submit, run with "doit!"           tune_5x5x8: Would submit, run with "doit!"
Line 89: Line 102:
 </code> </code>
  
-Only when ''submit.py'' is called with ''doit!'' as its first argument it will actually submit jobs:+Only when ''tune_submit.py'' is called with ''doit!'' as its first argumentwill it actually submit jobs:
 <code> <code>
-$ ./submit.py doit!+$ ./tune_submit.py doit!
           tune_5x5x5: Submitting           tune_5x5x5: Submitting
 Submitted batch job 277987 Submitted batch job 277987
Line 111: Line 124:
 </code> </code>
  
-=== Step 5: Collect Results === +=== Step 6: Collect Results === 
-Run ''collect.py'' to parse all log files and to determine the best kernel for each blocksize:+Run ''tune_collect.py'' to parse all log files and determine the best kernel for each blocksize:
 <code> <code>
-$ ./collect.py+$ ./tune_collect.py
 Reading: tune_5x5x5/tune_5x5x5_exe0.log Reading: tune_5x5x5/tune_5x5x5_exe0.log
 Reading: tune_5x5x8/tune_5x5x8_exe0.log Reading: tune_5x5x8/tune_5x5x8_exe0.log
Line 131: Line 144:
 Kernel_dnt_tiny(m=8, n=8, k=5, split_thread=32, threads=96, grouping=16, minblocks=1) , # 62.8469 GFlops  Kernel_dnt_tiny(m=8, n=8, k=5, split_thread=32, threads=96, grouping=16, minblocks=1) , # 62.8469 GFlops 
 Kernel_dnt_tiny(m=8, n=8, k=8, split_thread=32, threads=128, grouping=16, minblocks=1) , # 90.7763 GFlops  Kernel_dnt_tiny(m=8, n=8, k=8, split_thread=32, threads=128, grouping=16, minblocks=1) , # 90.7763 GFlops 
 +
 +Wrote parameters.json
 </code> </code>
 +
 +The file ''parameters.json'' now contains the newly autotuned parameters.
 +
 +=== Step 7: Merge new parameters with original parameter-file ===
 +Run ''tune_merge.py'' to merge the new parameters with the original ones:
 +<code>
 +$ ./tune_merge.py
 +Merging parameters.json with parameters_P100.json
 +Wrote parameters.new.json
 +</code>
 +
 +The file ''parameters.new.json'' can now be used as a parameter file. Rename it to ''parameters_GPU.json'', with the appropriate ''GPU''
 +
 +=== Step 8: Contribute parameters to the community ===
 +
 +**Contribute new optimal parameters**
 +
 +Submit a pull request updating the appropriate ''parameters_GPU.json'' file to the [[https://github.com/cp2k/dbcsr|DBCSR repository]].
 +
 +**Contribute autotuning data**
 +
 +See [[https://github.com/cp2k/dbcsr-data#contributing|instructions]] in DBCSR's [[https://github.com/cp2k/dbcsr-data|data repository]].
 +