dev:profiling
Differences
This shows you the differences between two versions of the page.
Next revision | Previous revisionLast revisionBoth sides next revision | ||
profiling [2013/07/02 05:00] – Initial page with profiling information vondele | dev:profiling [2019/04/09 09:24] – [nvprof] alazzaro | ||
---|---|---|---|
Line 4: | Line 4: | ||
In order to understand and improve the run-time performance of CP2K it is often needed to collect detailed timing information. This provides answers to questions such as 'how long does this subroutine take?' or 'who calls this subroutine' | In order to understand and improve the run-time performance of CP2K it is often needed to collect detailed timing information. This provides answers to questions such as 'how long does this subroutine take?' or 'who calls this subroutine' | ||
- | Probably the most important part of a profiling exercise is taken a reasonable test case. Does it run quickly enough to be practical, long enough to be accurate ? Most importantly, | + | Probably the most important part of a profiling exercise is finding |
===== The basics ===== | ===== The basics ===== | ||
Line 66: | Line 66: | ||
- SUBROUTINE: name, usually easily found in the CP2K code by 'grep qs_ks_build_kohn_sham_matrix' | - SUBROUTINE: name, usually easily found in the CP2K code by 'grep qs_ks_build_kohn_sham_matrix' | ||
- CALLS: the number of calls to this timer | - CALLS: the number of calls to this timer | ||
- | - ASD: The ' | + | - ASD: The ' |
- SELF TIME: How much time is spent in this subroutine, or in non-timed subroutines called by this subroutine. AVERAGE and MAXIMUM correspond to this quantity compared between different MPI ranks, and can be used to locate load-imbalance or synchronization points. | - SELF TIME: How much time is spent in this subroutine, or in non-timed subroutines called by this subroutine. AVERAGE and MAXIMUM correspond to this quantity compared between different MPI ranks, and can be used to locate load-imbalance or synchronization points. | ||
- TOTAL TIME: How much time is spent in this subroutine, including time spent in timed subroutines. AVERAGE and MAXIMUM as defined above | - TOTAL TIME: How much time is spent in this subroutine, including time spent in timed subroutines. AVERAGE and MAXIMUM as defined above | ||
- | Note that, for the threaded code, only the master thread is instrumented. | + | By default, only routines contributing up to 2% of the total runtime are included in the timing report. |
+ | Note that, for the threaded code, only the master thread is instrumented. | ||
==== Modifying the timing report ==== | ==== Modifying the timing report ==== | ||
Line 119: | Line 120: | ||
As the trace tends to be very verbose, further keywords (TRACE_MAX, TRACE_ROUTINES) allow for restricting the amount of information. | As the trace tends to be very verbose, further keywords (TRACE_MAX, TRACE_ROUTINES) allow for restricting the amount of information. | ||
+ | |||
+ | ===== Using oprofile for sampling based results ===== | ||
+ | |||
+ | Detailed information, | ||
+ | |||
+ | First, this tool works best with good debug info, so compile your full code using '-O2 -g -fno-omit-frame-pointer' | ||
+ | |||
+ | Next, perform a profiling run, removing any old data first: | ||
+ | |||
+ | < | ||
+ | rm -fR oprofile_data/ | ||
+ | operf --callgraph ../ | ||
+ | </ | ||
+ | |||
+ | After execution, the data can be analyzed in various ways. Three options are particularly useful. The first two provide a general overview, listing the routines that consume most CPU time. | ||
+ | |||
+ | < | ||
+ | opreport --exclude-dependent --demangle=smart --symbols > report.symbols | ||
+ | opreport -cg > report.callgraph | ||
+ | </ | ||
+ | |||
+ | The output of the first option is rather obvious: | ||
+ | |||
+ | < | ||
+ | CPU: Intel Sandy Bridge microarchitecture, | ||
+ | Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (No unit mask) count 90000 | ||
+ | samples | ||
+ | 78561 10.4904 | ||
+ | 72344 | ||
+ | 47158 | ||
+ | 41821 | ||
+ | 41761 | ||
+ | 37830 | ||
+ | 19554 | ||
+ | 18609 | ||
+ | 17633 | ||
+ | 16991 | ||
+ | 16982 | ||
+ | 15819 | ||
+ | 15678 | ||
+ | 15382 | ||
+ | </ | ||
+ | |||
+ | for the second the manual needs to be studied :-) | ||
+ | |||
+ | The third option delivers line based profiling results, which can pinpoint individual statements as consuming a lot of time: | ||
+ | |||
+ | < | ||
+ | opannotate --source --include-file dbcsr_mm_csr.F > operf.annotate | ||
+ | </ | ||
+ | |||
+ | which results in a copy of the source code with lines annotated. However, keep in mind that the annotation is only as good as the quality of the debug info, and that compiler optimization might result in surprising timings. | ||
+ | |||
+ | < | ||
+ | [...] | ||
+ | | ||
+ | 85 0.0065 : | ||
+ | : | ||
+ | | ||
+ | | ||
+ | : | ||
+ | 1728 0.1325 : | ||
+ | | ||
+ | : | ||
+ | 1104 0.0846 : k_size = k_sizes (a_col_l) | ||
+ | 3194 0.2448 : mapped_k_size = k_size_maps(k_size) | ||
+ | : | ||
+ | | ||
+ | 5763 0.4418 : b_blk_cycle: | ||
+ | : | ||
+ | : | ||
+ | : | ||
+ | : | ||
+ | | ||
+ | 8289 0.6354 : IF (a_norm * b_norm .LT. a_row_eps) THEN | ||
+ | : | ||
+ | : | ||
+ | 3073 0.2356 : | ||
+ | [...] | ||
+ | </ | ||
+ | |||
+ | In principle oprofile output can be converted to kcachegrind readable files, figuring this out is a TODO. | ||
+ | |||
+ | ===== Valgrind ===== | ||
+ | |||
+ | In some cases, a very detailed callgraph and timing info is required, and it is better to employ the [[http:// | ||
+ | |||
+ | Basic profiling is easy: | ||
+ | < | ||
+ | valgrind --tool=callgrind ./cp2k.sopt -i test.inp -o test.out | ||
+ | </ | ||
+ | The result, a file named callgrind.out.XXX, | ||
+ | |||
+ | ===== nvprof ===== | ||
+ | |||
+ | Profiling the CUDA code can be done quite nicely using the nvprof tool. To do so, it is useful to enable user events which requires compiling cp2k with < | ||
+ | < | ||
+ | nvprof -o log.nvprof ./cp2k.sopt -i test.inp -o test.out | ||
+ | </ | ||
+ | and visualize log.nvprof with the nvvp tool, which might take several minutes to open the data. | ||
+ | |||
+ | An example profile for a linear scaling benchmark (TiO2) is shown here | ||
+ | {{ :: | ||
+ | |||
+ | To run on CRAY architectures in parallel the following additional tricks are needed | ||
+ | < | ||
+ | export PMI_NO_FORK=1 | ||
+ | # no cuda proxy | ||
+ | # export CRAY_CUDA_MPS=1 | ||
+ | # use all cores with OMP | ||
+ | export OMP_NUM_THREADS=8 | ||
+ | # use aprun in MPMD mode to have only the output from the master rank (here 169 nodes are used) | ||
+ | COMMAND=" | ||
+ | PART1=" | ||
+ | PART2=" | ||
+ | aprun ${PART1} : ${PART2} | ||
+ | </ | ||
dev/profiling.txt · Last modified: 2020/08/21 10:15 by 127.0.0.1