Differences

This shows you the differences between two versions of the page.

--- howto:libcusmm [2019/02/06 11:22] – Reflect changes brought by PR #137 to DBCSR repo sjakobovits
+++ howto:libcusmm [2019/02/06 12:18] – Add merge and contribution instructions sjakobovits
@@ Line 1: / Line 1: @@
 ====== Howto Optimize Cuda Kernels for Libcusmm ======
 **Python version required:** python3.6
-If you are about to autotune parameters for a new GPU (i.e. a GPU for which there are no autotuned parameters yet), please first follow these instructions.
+If you are about to autotune parameters for a new GPU (i.e. a GPU for which there are no autotuned parameters yet), please first follow [[https://github.com/cp2k/dbcsr/tree/develop/src/acc/libsmm_acc/libcusmm#adding-support-for-a-new-gpu-card|the instructions for a new GPU]].
 === Step 1: Go to the libcusmm directory ===
@@ Line 12: / Line 13: @@
 <code python>
 ...
-def gen_jobfile(outdir, m, n, k):
+  def gen_jobfile(outdir, m, n, k):
-    t = "/tune_%dx%dx%d"%(m,n,k)
+      t = "/tune_%dx%dx%d" % (m, n, k)
-    all_exe_src = [basename(fn) for fn in glob(outdir+t+"_*_main.cu")]
+      all_exe_src = [os.path.basename(fn) for fn in glob(outdir + t + "_*_main.cu")]
-    all_exe = sorted([fn.replace("_main.cu", "") for fn in all_exe_src])
+      all_exe = sorted([fn.replace("_main.cu", "") for fn in all_exe_src])
-    output = "#!/bin/bash -l\n"
+      output = "#!/bin/bash -l\n"
-    output += "#SBATCH --nodes=%d\n"%len(all_exe)
+      output += "#SBATCH --nodes=%d\n" % len(all_exe)
-    output += "#SBATCH --time=0:30:00\n"
+      output += "#SBATCH --time=0:30:00\n"
-    output += "#SBATCH --account=s441\n"
+      output += "#SBATCH --account=s238\n"
-    output += "\n"
+      output += "#SBATCH --partition=normal\n"
-    output += "source ${MODULESHOME}/init/sh;\n"
+      output += "#SBATCH --constraint=gpu\n"
-    output += "module unload PrgEnv-cray\n"
+      output += "\n"
-    output += "module load cudatoolkit PrgEnv-gnu\n"
+      output += "source ${MODULESHOME}/init/sh;\n"
-    output += "module list\n"
+      output += "module load daint-gpu\n"
-    output += "cd $SLURM_SUBMIT_DIR \n"
+      output += "module unload PrgEnv-cray\n"
-    output += "\n"
+      output += "module load PrgEnv-gnu/6.0.3\n"
-    output += "date\n"
+      output += "module load cudatoolkit/8.0.54_2.2.8_ga620558-2.1\n"
-    for exe in all_exe:
+      output += "module list\n"
-        output += "aprun -b -n 1 -N 1 -d 8 make -j 16 %s &\n"%exe
+      output += "export CRAY_CUDA_MPS=1\n"
+      output += "cd $SLURM_SUBMIT_DIR \n"
+      output += "\n"
+      output += "date\n"
+      for exe in all_exe:
+          output += (
+              "srun --nodes=1 --bcast=/tmp/${USER} --ntasks=1 --ntasks-per-node=1 --cpus-per-task=12 make -j 24 %s &\n" %
+              exe)
    ...
+...
 </code>
 === Step 3: Run the script tune_setup.py ===
-The script takes as arguments the blocksizes you want to add to libcusmm. For example, if your system contains blocks of size 5 and 8 type:
+The script takes as arguments the blocksizes you want to add to libcusmm. For example, if the system you want to autotune for contains blocks of size 5 and 8, run:
 <code>
 $ ./tune_setup.py 5 8
@@ Line 69: / Line 78: @@
 In order to parallelize the benchmarking, the launchers are distributed over multiple executables.
-Currently, up to 10000 launchers are benchmarked by one //executable//. Each executable is linked together from several ''tune_*_part???.o'' and a ''tune_*_main.o''. Each part-files contains up to 100 launchers. This allows to parallelize the compilation over multiple CPU cores.
+Currently, up to 10'000 launchers are benchmarked by one //executable//. Each executable is linked together from several ''tune_*_part???.o'' and a ''tune_*_main.o''. Each part-files contains up to 100 launchers. This allows to parallelize the compilation over multiple CPU cores.
 === Step 4: Adapt tune_submit.py to your environment ===
-The script ''tune_submit.py'' was written for the slurm batch system as used e.g. by CRAY supercomputers. If your computer runs a different batch system you have to adapt ''tune_submit.py'' accordingly.
+The script ''tune_submit.py'' was written for the slurm batch system as used e.g. by CRAY supercomputers. If your computer runs a different batch system, you have to adapt ''tune_submit.py'' accordingly.
 === Step 5: Submit Jobs ===
@@ Line 114: / Line 123: @@
 </code>
-=== Step 5: Collect Results ===
+=== Step 6: Collect Results ===
-Run ''tune_collect.py'' to parse all log files and to determine the best kernel for each blocksize:
+Run ''tune_collect.py'' to parse all log files and determine the best kernel for each blocksize:
 <code>
 $ ./tune_collect.py
@@ Line 134: / Line 143: @@
 Kernel_dnt_tiny(m=8, n=8, k=5, split_thread=32, threads=96, grouping=16, minblocks=1) , # 62.8469 GFlops
 Kernel_dnt_tiny(m=8, n=8, k=8, split_thread=32, threads=128, grouping=16, minblocks=1) , # 90.7763 GFlops
+Wrote parameters.json
 </code>
+The file ''parameters.json'' now contains the newly autotuned parameters.
+=== Step 7: Merge new parameters with original parameter-file ===
+Run ''tune_merge.py'' to merge the new parameters with the original ones:
+<code>
+$ ./tune_merge.py
+Merging parameters.json with parameters_P100.json
+Wrote parameters.new.json
+</code>
+The file ''parameters.new.json'' can now be used as a parameter file. Rename it to ''parameters_GPU.json'', with the appropriate ''GPU''.
+=== Step 8: Contribute parameters to the community ===
+**Contribute new optimal parameters**
+Submit a pull request updating the appropriate ''parameters_GPU.json'' file to the [[https://github.com/cp2k/dbcsr|DBCSR repository]].
+**Contribute autotuning data**
+See [[https://github.com/cp2k/dbcsr-data#contributing|instructions]] in DBCSR's [[https://github.com/cp2k/dbcsr-data|data repository]].