dev:profiling
Differences
This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revisionNext revision | Previous revisionNext revisionBoth sides next revision | ||
profiling [2013/07/02 12:17] – [Why profiling ?] : fix grammar 129.132.169.16 | dev:profiling [2014/02/08 21:54] – profiling renamed to dev:profiling oschuett | ||
---|---|---|---|
Line 66: | Line 66: | ||
- SUBROUTINE: name, usually easily found in the CP2K code by 'grep qs_ks_build_kohn_sham_matrix' | - SUBROUTINE: name, usually easily found in the CP2K code by 'grep qs_ks_build_kohn_sham_matrix' | ||
- CALLS: the number of calls to this timer | - CALLS: the number of calls to this timer | ||
- | - ASD: The ' | + | - ASD: The ' |
- SELF TIME: How much time is spent in this subroutine, or in non-timed subroutines called by this subroutine. AVERAGE and MAXIMUM correspond to this quantity compared between different MPI ranks, and can be used to locate load-imbalance or synchronization points. | - SELF TIME: How much time is spent in this subroutine, or in non-timed subroutines called by this subroutine. AVERAGE and MAXIMUM correspond to this quantity compared between different MPI ranks, and can be used to locate load-imbalance or synchronization points. | ||
- TOTAL TIME: How much time is spent in this subroutine, including time spent in timed subroutines. AVERAGE and MAXIMUM as defined above | - TOTAL TIME: How much time is spent in this subroutine, including time spent in timed subroutines. AVERAGE and MAXIMUM as defined above | ||
- | Note that, for the threaded code, only the master thread is instrumented. | + | By default, only routines contributing up to 2% of the total runtime are included in the timing report. |
+ | Note that, for the threaded code, only the master thread is instrumented. | ||
==== Modifying the timing report ==== | ==== Modifying the timing report ==== | ||
Line 199: | Line 200: | ||
[...] | [...] | ||
</ | </ | ||
+ | |||
+ | In principle oprofile output can be converted to kcachegrind readable files, figuring this out is a TODO. | ||
+ | |||
+ | ===== Valgrind ===== | ||
+ | |||
+ | In some cases, a very detailed callgraph and timing info is required, and it is better to employ the [[http:// | ||
+ | |||
+ | Basic profiling is easy: | ||
+ | < | ||
+ | valgrind --tool=callgrind ./cp2k.sopt -i test.inp -o test.out | ||
+ | </ | ||
+ | The result, a file named callgrind.out.XXX, | ||
+ | |||
+ | ===== nvprof ===== | ||
+ | |||
+ | Profiling the CUDA code can be done quite nicely using the nvprof tool. To do so, it is useful to enable user events which requires compiling cp2k with < | ||
+ | < | ||
+ | nvprof -o log.nvprof ./cp2k.sopt -i test.inp -o test.out | ||
+ | </ | ||
+ | and visualize log.nvprof with the nvvp tool, which might take several minutes to open the data. | ||
+ | |||
+ | An example profile for a linear scaling benchmark (TiO2) is shown here | ||
+ | {{ :: | ||
+ | |||
+ | To run on CRAY architectures in parallel the following additional tricks are needed | ||
+ | < | ||
+ | export PMI_NO_FORK=1 | ||
+ | # no cuda proxy | ||
+ | # export CRAY_CUDA_PROXY=1 | ||
+ | # use all cores with OMP | ||
+ | export OMP_NUM_THREADS=8 | ||
+ | # use aprun in MPMD mode to have only the output from the master rank (here 169 nodes are used) | ||
+ | COMMAND=" | ||
+ | PART1=" | ||
+ | PART2=" | ||
+ | aprun ${PART1} : ${PART2} | ||
+ | </ | ||
+ | |||
dev/profiling.txt · Last modified: 2020/08/21 10:15 by 127.0.0.1