Differences

This shows you the differences between two versions of the page.

--- profiling [2013/07/02 12:17] – [Why profiling ?] : fix grammar 129.132.169.16
+++ dev:profiling [2014/02/08 21:54] – profiling renamed to dev:profiling oschuett
@@ Line 66: / Line 66: @@
   - SUBROUTINE: name, usually easily found in the CP2K code by 'grep qs_ks_build_kohn_sham_matrix' or similar. Sometimes a suffix is used to time sub-parts of a subroutine or to provide details.
   - CALLS: the number of calls to this timer
-  - ASD: The 'Average Stack Depth', i.e. the average number of entries on the call stack. In the example above, it can be used to guess that qs_energies_scf is called by qs_forces is called by CP2K. However, see also the CALLGRAPH section below.
+  - ASD: The 'Average Stack Depth', i.e. the average number of entries on the call stack. In the example above, it can be used to guess that qs_energies_scf is called by qs_forces is called by CP2K. However, see also the [[#the_cp2k_callgraph | CALLGRAPH section]] below.
   - SELF TIME: How much time is spent in this subroutine, or in non-timed subroutines called by this subroutine. AVERAGE and MAXIMUM correspond to this quantity compared between different MPI ranks, and can be used to locate load-imbalance or synchronization points.
   - TOTAL TIME: How much time is spent in this subroutine, including time spent in timed subroutines. AVERAGE and MAXIMUM as defined above
-Note that, for the threaded code, only the master thread is instrumented.
+By default, only routines contributing up to 2% of the total runtime are included in the timing report.  To see smaller routines, set a smaller cut-off with the [[http://manual.cp2k.org/trunk/CP2K_INPUT/GLOBAL/TIMINGS.html#desc_THRESHOLD|GLOBAL%TIMINGS%THRESHOLD]] keyword
+Note that, for the threaded code, only the master thread is instrumented.
 ==== Modifying the timing report ====
@@ Line 199: / Line 200: @@
 [...]
 </code>
+In principle oprofile output can be converted to kcachegrind readable files, figuring this out is a TODO.
+===== Valgrind =====
+In some cases, a very detailed callgraph and timing info is required, and it is better to employ the [[http://valgrind.org/|valgrind]] tool. Valgrind is essentially a simulator, which allows for obtaining exact counts in the callgraph, and details about cache misses, branch mispredictions etc. The disadvantage is two-fold. First, profiling takes a long time (50x slowdown under this simulator is common), and second, it is a simulated architecture which might be (slightly) different from a real CPU (in rare cases, instructions for very new CPUs are not supported. In that case, compile without '-march=native' or optimized (e.g. mkl) libraries).
+Basic profiling is easy:
+<code>
+valgrind --tool=callgrind ./cp2k.sopt -i test.inp -o test.out
+</code>
+The result, a file named callgrind.out.XXX, can be visualized with [[http://kcachegrind.sourceforge.net/html/Home.html|kcachegrind]]
+===== nvprof =====
+Profiling the CUDA code can be done quite nicely using the nvprof tool. To do so, it is useful to enable user events which requires compiling cp2k with <code> -D__CUDA_PROFILING </code> and linking against <code> -lnvToolsExt </code> library. For the serial code things are easy just run
+<code>
+nvprof -o log.nvprof ./cp2k.sopt -i test.inp -o test.out
+</code>
+and visualize log.nvprof with the nvvp tool, which might take several minutes to open the data.
+An example profile for a linear scaling benchmark (TiO2) is shown here
+{{ ::screenshot_nvvp_tio2.png?direct&800 | Sample profile from CP2K on TiO2}}
+To run on CRAY architectures in parallel the following additional tricks are needed
+<code>
+export PMI_NO_FORK=1
+# no cuda proxy
+# export CRAY_CUDA_PROXY=1
+# use all cores with OMP
+export OMP_NUM_THREADS=8
+# use aprun in MPMD mode to have only the output from the master rank (here 169 nodes are used)
+COMMAND="./cp2k.psmp -i test.inp -o test.out-profile"
+PART1="-N 1  -n 1 -d ${OMP_NUM_THREADS} nvprof -o log.nvprof ${COMMAND}"
+PART2="-N 1  -n 168 -d ${OMP_NUM_THREADS} ${COMMAND}"
+aprun ${PART1} : ${PART2}
+</code>