howto:libcusmm
Differences
This shows you the differences between two versions of the page.
| Both sides previous revisionPrevious revision | |||
| howto:libcusmm [2019/02/08 13:50] – describe arguments for tune_setup.y sjakobovits | howto:libcusmm [2019/04/09 12:45] (current) – removed alazzaro | ||
|---|---|---|---|
| Line 1: | Line 1: | ||
| - | ====== Howto Optimize Cuda Kernels for Libcusmm ====== | ||
| - | **Python version required:** python3.6 | ||
| - | |||
| - | If you are about to autotune parameters for a new GPU (i.e. a GPU for which there are no autotuned parameters yet), please first follow [[https:// | ||
| - | |||
| - | === Step 1: Go to the libcusmm directory === | ||
| - | < | ||
| - | $ cd dbcsr/ | ||
| - | </ | ||
| - | |||
| - | === Step 2: Adapt tune_setup.py to your environment === | ||
| - | The '' | ||
| - | <code python> | ||
| - | ... | ||
| - | def gen_jobfile(outdir, | ||
| - | t = "/ | ||
| - | all_exe_src = [os.path.basename(fn) for fn in glob(outdir + t + " | ||
| - | all_exe = sorted([fn.replace(" | ||
| - | |||
| - | output = "# | ||
| - | output += "# | ||
| - | output += "# | ||
| - | output += "# | ||
| - | output += "# | ||
| - | output += "# | ||
| - | output += " | ||
| - | output += " | ||
| - | output += " | ||
| - | output += " | ||
| - | output += " | ||
| - | output += " | ||
| - | output += " | ||
| - | output += " | ||
| - | output += "cd $SLURM_SUBMIT_DIR \n" | ||
| - | output += " | ||
| - | output += " | ||
| - | for exe in all_exe: | ||
| - | output += ( | ||
| - | "srun --nodes=1 --bcast=/ | ||
| - | exe) | ||
| - | ... | ||
| - | ... | ||
| - | </ | ||
| - | |||
| - | === Step 3: Run the script tune_setup.py === | ||
| - | Specify which GPU you are autotuning for by passing the appropriate '' | ||
| - | In addition, the script takes as arguments the blocksizes you want to add to libcusmm. For example, if the system you want to autotune for contains blocks of size 5 and 8, run: | ||
| - | < | ||
| - | $ ./ | ||
| - | Found 23 parameter sets for 5x5x5 | ||
| - | Found 31 parameter sets for 5x5x8 | ||
| - | Found 107 parameter sets for 5x8x5 | ||
| - | Found 171 parameter sets for 5x8x8 | ||
| - | Found 75 parameter sets for 8x5x5 | ||
| - | Found 107 parameter sets for 8x5x8 | ||
| - | Found 248 parameter sets for 8x8x5 | ||
| - | Found 424 parameter sets for 8x8x8 | ||
| - | </ | ||
| - | |||
| - | The script will create a directory for each combination of the blocksizes: | ||
| - | < | ||
| - | $ ls -d tune_* | ||
| - | tune_5x5x5 | ||
| - | </ | ||
| - | |||
| - | Each directory contains a number of files: | ||
| - | < | ||
| - | $ ls -1 tune_8x8x8/ | ||
| - | Makefile | ||
| - | tune_8x8x8_exe0_main.cu | ||
| - | tune_8x8x8_exe0_part0.cu | ||
| - | tune_8x8x8_exe0_part1.cu | ||
| - | tune_8x8x8_exe0_part2.cu | ||
| - | tune_8x8x8_exe0_part3.cu | ||
| - | tune_8x8x8_exe0_part4.cu | ||
| - | tune_8x8x8.job | ||
| - | </ | ||
| - | For each possible parameter-set a // | ||
| - | |||
| - | In order to parallelize the benchmarking, | ||
| - | Currently, up to 10'000 launchers are benchmarked by one // | ||
| - | |||
| - | === Step 4: Adapt tune_submit.py to your environment === | ||
| - | The script '' | ||
| - | |||
| - | === Step 5: Submit Jobs === | ||
| - | Each tune-directory contains a job file. | ||
| - | Since there might be many tune-directories, | ||
| - | |||
| - | When '' | ||
| - | < | ||
| - | $ ./ | ||
| - | tune_5x5x5: Would submit, run with " | ||
| - | tune_5x5x8: Would submit, run with " | ||
| - | tune_5x8x5: Would submit, run with " | ||
| - | tune_5x8x8: Would submit, run with " | ||
| - | tune_8x5x5: Would submit, run with " | ||
| - | tune_8x5x8: Would submit, run with " | ||
| - | tune_8x8x5: Would submit, run with " | ||
| - | tune_8x8x8: Would submit, run with " | ||
| - | Number of jobs submitted: 8 | ||
| - | </ | ||
| - | |||
| - | Only when '' | ||
| - | < | ||
| - | $ ./ | ||
| - | tune_5x5x5: Submitting | ||
| - | Submitted batch job 277987 | ||
| - | tune_5x5x8: Submitting | ||
| - | Submitted batch job 277988 | ||
| - | tune_5x8x5: Submitting | ||
| - | Submitted batch job 277989 | ||
| - | tune_5x8x8: Submitting | ||
| - | Submitted batch job 277990 | ||
| - | tune_8x5x5: Submitting | ||
| - | Submitted batch job 277991 | ||
| - | tune_8x5x8: Submitting | ||
| - | Submitted batch job 277992 | ||
| - | tune_8x8x5: Submitting | ||
| - | Submitted batch job 277993 | ||
| - | tune_8x8x8: Submitting | ||
| - | Submitted batch job 277994 | ||
| - | Number of jobs submitted: 8 | ||
| - | </ | ||
| - | |||
| - | === Step 6: Collect Results === | ||
| - | Run '' | ||
| - | < | ||
| - | $ ./ | ||
| - | Reading: tune_5x5x5/ | ||
| - | Reading: tune_5x5x8/ | ||
| - | Reading: tune_5x8x5/ | ||
| - | Reading: tune_5x8x8/ | ||
| - | Reading: tune_8x5x5/ | ||
| - | Reading: tune_8x5x8/ | ||
| - | Reading: tune_8x8x5/ | ||
| - | Reading: tune_8x8x8/ | ||
| - | Kernel_dnt_tiny(m=5, | ||
| - | Kernel_dnt_tiny(m=5, | ||
| - | Kernel_dnt_medium(m=5, | ||
| - | Kernel_dnt_tiny(m=5, | ||
| - | Kernel_dnt_medium(m=8, | ||
| - | Kernel_dnt_medium(m=8, | ||
| - | Kernel_dnt_tiny(m=8, | ||
| - | Kernel_dnt_tiny(m=8, | ||
| - | |||
| - | Wrote parameters.json | ||
| - | </ | ||
| - | |||
| - | The file '' | ||
| - | |||
| - | === Step 7: Merge new parameters with original parameter-file === | ||
| - | Run '' | ||
| - | < | ||
| - | $ ./ | ||
| - | Merging parameters.json with parameters_P100.json | ||
| - | Wrote parameters.new.json | ||
| - | </ | ||
| - | |||
| - | The file '' | ||
| - | |||
| - | === Step 8: Contribute parameters to the community === | ||
| - | |||
| - | **Contribute new optimal parameters** | ||
| - | |||
| - | Submit a pull request updating the appropriate '' | ||
| - | |||
| - | **Contribute autotuning data** | ||
| - | |||
| - | See [[https:// | ||
howto/libcusmm.1549633858.txt.gz · Last modified: (external edit)
