Differences

This shows you the differences between two versions of the page.

--- profiling [2013/07/02 07:30] – added section on oprofile vondele
+++ dev:profiling [2019/04/09 09:24] – [nvprof] alazzaro
@@ Line 4: / Line 4: @@
 In order to understand and improve the run-time performance of CP2K it is often needed to collect detailed timing information. This provides answers to questions such as 'how long does this subroutine take?' or 'who calls this subroutine'. This is most useful for developers, (non-expert) users rarely ever need this detailed info.
-Probably the most important part of a profiling exercise is taken a reasonable test case. Does it run quickly enough to be practical, long enough to be accurate ? Most importantly, does it exercise the interesting or relevant parts of the code. By selecting a suitable input, profiling can become much more precise, interesting, ...
+Probably the most important part of a profiling exercise is finding a reasonable test case. Does it run quickly enough to be practical, long enough to be accurate ? Most importantly, does it exercise the interesting or relevant parts of the code. By selecting a suitable input, profiling can become much more precise, interesting, ...
 ===== The basics =====
@@ Line 66: / Line 66: @@
   - SUBROUTINE: name, usually easily found in the CP2K code by 'grep qs_ks_build_kohn_sham_matrix' or similar. Sometimes a suffix is used to time sub-parts of a subroutine or to provide details.
   - CALLS: the number of calls to this timer
-  - ASD: The 'Average Stack Depth', i.e. the average number of entries on the call stack. In the example above, it can be used to guess that qs_energies_scf is called by qs_forces is called by CP2K. However, see also the CALLGRAPH section below.
+  - ASD: The 'Average Stack Depth', i.e. the average number of entries on the call stack. In the example above, it can be used to guess that qs_energies_scf is called by qs_forces is called by CP2K. However, see also the [[#the_cp2k_callgraph | CALLGRAPH section]] below.
   - SELF TIME: How much time is spent in this subroutine, or in non-timed subroutines called by this subroutine. AVERAGE and MAXIMUM correspond to this quantity compared between different MPI ranks, and can be used to locate load-imbalance or synchronization points.
   - TOTAL TIME: How much time is spent in this subroutine, including time spent in timed subroutines. AVERAGE and MAXIMUM as defined above
-Note that, for the threaded code, only the master thread is instrumented.
+By default, only routines contributing up to 2% of the total runtime are included in the timing report.  To see smaller routines, set a smaller cut-off with the [[http://manual.cp2k.org/trunk/CP2K_INPUT/GLOBAL/TIMINGS.html#desc_THRESHOLD|GLOBAL%TIMINGS%THRESHOLD]] keyword
+Note that, for the threaded code, only the master thread is instrumented.
 ==== Modifying the timing report ====
@@ Line 199: / Line 200: @@
 [...]
 </code>
+In principle oprofile output can be converted to kcachegrind readable files, figuring this out is a TODO.
+===== Valgrind =====
+In some cases, a very detailed callgraph and timing info is required, and it is better to employ the [[http://valgrind.org/|valgrind]] tool. Valgrind is essentially a simulator, which allows for obtaining exact counts in the callgraph, and details about cache misses, branch mispredictions etc. The disadvantage is two-fold. First, profiling takes a long time (50x slowdown under this simulator is common), and second, it is a simulated architecture which might be (slightly) different from a real CPU (in rare cases, instructions for very new CPUs are not supported. In that case, compile without '-march=native' or optimized (e.g. mkl) libraries).
+Basic profiling is easy:
+<code>
+valgrind --tool=callgrind ./cp2k.sopt -i test.inp -o test.out
+</code>
+The result, a file named callgrind.out.XXX, can be visualized with [[http://kcachegrind.sourceforge.net/html/Home.html|kcachegrind]]
+===== nvprof =====
+Profiling the CUDA code can be done quite nicely using the nvprof tool. To do so, it is useful to enable user events which requires compiling cp2k with <code> -D__CUDA_PROFILING </code> and linking against <code> -lnvToolsExt </code> library. For the serial code things are easy just run
+<code>
+nvprof -o log.nvprof ./cp2k.sopt -i test.inp -o test.out
+</code>
+and visualize log.nvprof with the nvvp tool, which might take several minutes to open the data.
+An example profile for a linear scaling benchmark (TiO2) is shown here
+{{ ::screenshot_nvvp_tio2.png?direct&800 | Sample profile from CP2K on TiO2}}
+To run on CRAY architectures in parallel the following additional tricks are needed
+<code>
+export PMI_NO_FORK=1
+# no cuda proxy
+# export CRAY_CUDA_MPS=1
+# use all cores with OMP
+export OMP_NUM_THREADS=8
+# use aprun in MPMD mode to have only the output from the master rank (here 169 nodes are used)
+COMMAND="./cp2k.psmp -i test.inp -o test.out-profile"
+PART1="-N 1  -n 1 -d ${OMP_NUM_THREADS} nvprof -o log.nvprof ${COMMAND}"
+PART2="-N 1  -n 168 -d ${OMP_NUM_THREADS} ${COMMAND}"
+aprun ${PART1} : ${PART2}
+</code>