User Tools

Site Tools


dev:profiling

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
Last revisionBoth sides next revision
profiling [2013/07/03 04:11] – [understanding the CP2K timing output] oschuettdev:profiling [2019/04/09 09:24] – [nvprof] alazzaro
Line 70: Line 70:
   - TOTAL TIME: How much time is spent in this subroutine, including time spent in timed subroutines. AVERAGE and MAXIMUM as defined above   - TOTAL TIME: How much time is spent in this subroutine, including time spent in timed subroutines. AVERAGE and MAXIMUM as defined above
  
-Note thatfor the threaded codeonly the master thread is instrumented.+By defaultonly routines contributing up to 2% of the total runtime are included in the timing report.  To see smaller routinesset a smaller cut-off with the [[http://manual.cp2k.org/trunk/CP2K_INPUT/GLOBAL/TIMINGS.html#desc_THRESHOLD|GLOBAL%TIMINGS%THRESHOLD]] keyword
  
 +Note that, for the threaded code, only the master thread is instrumented.
 ==== Modifying the timing report ==== ==== Modifying the timing report ====
  
Line 199: Line 200:
 [...] [...]
 </code> </code>
 +
 +In principle oprofile output can be converted to kcachegrind readable files, figuring this out is a TODO.
 +
 +===== Valgrind =====
 +
 +In some cases, a very detailed callgraph and timing info is required, and it is better to employ the [[http://valgrind.org/|valgrind]] tool. Valgrind is essentially a simulator, which allows for obtaining exact counts in the callgraph, and details about cache misses, branch mispredictions etc. The disadvantage is two-fold. First, profiling takes a long time (50x slowdown under this simulator is common), and second, it is a simulated architecture which might be (slightly) different from a real CPU (in rare cases, instructions for very new CPUs are not supported. In that case, compile without '-march=native' or optimized (e.g. mkl) libraries).
 +
 +Basic profiling is easy:
 +<code>
 +valgrind --tool=callgrind ./cp2k.sopt -i test.inp -o test.out
 +</code>
 +The result, a file named callgrind.out.XXX, can be visualized with [[http://kcachegrind.sourceforge.net/html/Home.html|kcachegrind]]
 +
 +===== nvprof =====
 +
 +Profiling the CUDA code can be done quite nicely using the nvprof tool. To do so, it is useful to enable user events which requires compiling cp2k with <code> -D__CUDA_PROFILING </code> and linking against <code> -lnvToolsExt </code> library. For the serial code things are easy just run
 +<code>
 +nvprof -o log.nvprof ./cp2k.sopt -i test.inp -o test.out
 +</code>
 +and visualize log.nvprof with the nvvp tool, which might take several minutes to open the data. 
 +
 +An example profile for a linear scaling benchmark (TiO2) is shown here
 +{{ ::screenshot_nvvp_tio2.png?direct&800 | Sample profile from CP2K on TiO2}}
 +
 +To run on CRAY architectures in parallel the following additional tricks are needed
 +<code>
 +export PMI_NO_FORK=1
 +# no cuda proxy
 +# export CRAY_CUDA_MPS=1
 +# use all cores with OMP
 +export OMP_NUM_THREADS=8
 +# use aprun in MPMD mode to have only the output from the master rank (here 169 nodes are used)
 +COMMAND="./cp2k.psmp -i test.inp -o test.out-profile"
 +PART1="-N 1  -n 1 -d ${OMP_NUM_THREADS} nvprof -o log.nvprof ${COMMAND}"
 +PART2="-N 1  -n 168 -d ${OMP_NUM_THREADS} ${COMMAND}"
 +aprun ${PART1} : ${PART2}
 +</code>
 +
  
dev/profiling.txt · Last modified: 2020/08/21 10:15 by 127.0.0.1