howto:libcusmm
Differences
This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revision | |||
howto:libcusmm [2019/02/08 13:50] – describe arguments for tune_setup.y sjakobovits | howto:libcusmm [2019/04/09 12:45] (current) – removed alazzaro | ||
---|---|---|---|
Line 1: | Line 1: | ||
- | ====== Howto Optimize Cuda Kernels for Libcusmm ====== | ||
- | **Python version required:** python3.6 | ||
- | |||
- | If you are about to autotune parameters for a new GPU (i.e. a GPU for which there are no autotuned parameters yet), please first follow [[https:// | ||
- | |||
- | === Step 1: Go to the libcusmm directory === | ||
- | < | ||
- | $ cd dbcsr/ | ||
- | </ | ||
- | |||
- | === Step 2: Adapt tune_setup.py to your environment === | ||
- | The '' | ||
- | <code python> | ||
- | ... | ||
- | def gen_jobfile(outdir, | ||
- | t = "/ | ||
- | all_exe_src = [os.path.basename(fn) for fn in glob(outdir + t + " | ||
- | all_exe = sorted([fn.replace(" | ||
- | |||
- | output = "# | ||
- | output += "# | ||
- | output += "# | ||
- | output += "# | ||
- | output += "# | ||
- | output += "# | ||
- | output += " | ||
- | output += " | ||
- | output += " | ||
- | output += " | ||
- | output += " | ||
- | output += " | ||
- | output += " | ||
- | output += " | ||
- | output += "cd $SLURM_SUBMIT_DIR \n" | ||
- | output += " | ||
- | output += " | ||
- | for exe in all_exe: | ||
- | output += ( | ||
- | "srun --nodes=1 --bcast=/ | ||
- | exe) | ||
- | ... | ||
- | ... | ||
- | </ | ||
- | |||
- | === Step 3: Run the script tune_setup.py === | ||
- | Specify which GPU you are autotuning for by passing the appropriate '' | ||
- | In addition, the script takes as arguments the blocksizes you want to add to libcusmm. For example, if the system you want to autotune for contains blocks of size 5 and 8, run: | ||
- | < | ||
- | $ ./ | ||
- | Found 23 parameter sets for 5x5x5 | ||
- | Found 31 parameter sets for 5x5x8 | ||
- | Found 107 parameter sets for 5x8x5 | ||
- | Found 171 parameter sets for 5x8x8 | ||
- | Found 75 parameter sets for 8x5x5 | ||
- | Found 107 parameter sets for 8x5x8 | ||
- | Found 248 parameter sets for 8x8x5 | ||
- | Found 424 parameter sets for 8x8x8 | ||
- | </ | ||
- | |||
- | The script will create a directory for each combination of the blocksizes: | ||
- | < | ||
- | $ ls -d tune_* | ||
- | tune_5x5x5 | ||
- | </ | ||
- | |||
- | Each directory contains a number of files: | ||
- | < | ||
- | $ ls -1 tune_8x8x8/ | ||
- | Makefile | ||
- | tune_8x8x8_exe0_main.cu | ||
- | tune_8x8x8_exe0_part0.cu | ||
- | tune_8x8x8_exe0_part1.cu | ||
- | tune_8x8x8_exe0_part2.cu | ||
- | tune_8x8x8_exe0_part3.cu | ||
- | tune_8x8x8_exe0_part4.cu | ||
- | tune_8x8x8.job | ||
- | </ | ||
- | For each possible parameter-set a // | ||
- | |||
- | In order to parallelize the benchmarking, | ||
- | Currently, up to 10'000 launchers are benchmarked by one // | ||
- | |||
- | === Step 4: Adapt tune_submit.py to your environment === | ||
- | The script '' | ||
- | |||
- | === Step 5: Submit Jobs === | ||
- | Each tune-directory contains a job file. | ||
- | Since there might be many tune-directories, | ||
- | |||
- | When '' | ||
- | < | ||
- | $ ./ | ||
- | tune_5x5x5: Would submit, run with " | ||
- | tune_5x5x8: Would submit, run with " | ||
- | tune_5x8x5: Would submit, run with " | ||
- | tune_5x8x8: Would submit, run with " | ||
- | tune_8x5x5: Would submit, run with " | ||
- | tune_8x5x8: Would submit, run with " | ||
- | tune_8x8x5: Would submit, run with " | ||
- | tune_8x8x8: Would submit, run with " | ||
- | Number of jobs submitted: 8 | ||
- | </ | ||
- | |||
- | Only when '' | ||
- | < | ||
- | $ ./ | ||
- | tune_5x5x5: Submitting | ||
- | Submitted batch job 277987 | ||
- | tune_5x5x8: Submitting | ||
- | Submitted batch job 277988 | ||
- | tune_5x8x5: Submitting | ||
- | Submitted batch job 277989 | ||
- | tune_5x8x8: Submitting | ||
- | Submitted batch job 277990 | ||
- | tune_8x5x5: Submitting | ||
- | Submitted batch job 277991 | ||
- | tune_8x5x8: Submitting | ||
- | Submitted batch job 277992 | ||
- | tune_8x8x5: Submitting | ||
- | Submitted batch job 277993 | ||
- | tune_8x8x8: Submitting | ||
- | Submitted batch job 277994 | ||
- | Number of jobs submitted: 8 | ||
- | </ | ||
- | |||
- | === Step 6: Collect Results === | ||
- | Run '' | ||
- | < | ||
- | $ ./ | ||
- | Reading: tune_5x5x5/ | ||
- | Reading: tune_5x5x8/ | ||
- | Reading: tune_5x8x5/ | ||
- | Reading: tune_5x8x8/ | ||
- | Reading: tune_8x5x5/ | ||
- | Reading: tune_8x5x8/ | ||
- | Reading: tune_8x8x5/ | ||
- | Reading: tune_8x8x8/ | ||
- | Kernel_dnt_tiny(m=5, | ||
- | Kernel_dnt_tiny(m=5, | ||
- | Kernel_dnt_medium(m=5, | ||
- | Kernel_dnt_tiny(m=5, | ||
- | Kernel_dnt_medium(m=8, | ||
- | Kernel_dnt_medium(m=8, | ||
- | Kernel_dnt_tiny(m=8, | ||
- | Kernel_dnt_tiny(m=8, | ||
- | |||
- | Wrote parameters.json | ||
- | </ | ||
- | |||
- | The file '' | ||
- | |||
- | === Step 7: Merge new parameters with original parameter-file === | ||
- | Run '' | ||
- | < | ||
- | $ ./ | ||
- | Merging parameters.json with parameters_P100.json | ||
- | Wrote parameters.new.json | ||
- | </ | ||
- | |||
- | The file '' | ||
- | |||
- | === Step 8: Contribute parameters to the community === | ||
- | |||
- | **Contribute new optimal parameters** | ||
- | |||
- | Submit a pull request updating the appropriate '' | ||
- | |||
- | **Contribute autotuning data** | ||
- | |||
- | See [[https:// | ||
howto/libcusmm.1549633858.txt.gz · Last modified: 2020/08/21 10:15 (external edit)