User Tools

Site Tools


dev:profiling

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
profiling [2013/07/03 06:52] – add section on valgrind 129.132.169.16dev:profiling [2020/08/21 10:15] (current) – external edit 127.0.0.1
Line 70: Line 70:
   - TOTAL TIME: How much time is spent in this subroutine, including time spent in timed subroutines. AVERAGE and MAXIMUM as defined above   - TOTAL TIME: How much time is spent in this subroutine, including time spent in timed subroutines. AVERAGE and MAXIMUM as defined above
  
-Note thatfor the threaded codeonly the master thread is instrumented.+By defaultonly routines contributing up to 2% of the total runtime are included in the timing report.  To see smaller routinesset a smaller cut-off with the [[http://manual.cp2k.org/trunk/CP2K_INPUT/GLOBAL/TIMINGS.html#desc_THRESHOLD|GLOBAL%TIMINGS%THRESHOLD]] keyword
  
 +Note that, for the threaded code, only the master thread is instrumented.
 ==== Modifying the timing report ==== ==== Modifying the timing report ====
  
Line 208: Line 209:
 Basic profiling is easy: Basic profiling is easy:
 <code> <code>
-valgrind --tool=callgraph ./cp2k.sopt -i test.inp -o test.out+valgrind --tool=callgrind ./cp2k.sopt -i test.inp -o test.out
 </code> </code>
 The result, a file named callgrind.out.XXX, can be visualized with [[http://kcachegrind.sourceforge.net/html/Home.html|kcachegrind]] The result, a file named callgrind.out.XXX, can be visualized with [[http://kcachegrind.sourceforge.net/html/Home.html|kcachegrind]]
  
 +===== nvprof =====
 +
 +Profiling the CUDA code can be done quite nicely using the nvprof tool. To do so, it is useful to enable user events which requires compiling cp2k with <code> -D__CUDA_PROFILING </code> and linking against <code> -lnvToolsExt </code> library. For the serial code things are easy just run
 +<code>
 +nvprof -o log.nvprof ./cp2k.sopt -i test.inp -o test.out
 +</code>
 +and visualize log.nvprof with the nvvp tool, which might take several minutes to open the data. 
 +
 +An example profile for a linear scaling benchmark (TiO2) is shown here
 +{{ ::screenshot_nvvp_tio2.png?direct&800 | Sample profile from CP2K on TiO2}}
 +
 +To run on CRAY architectures in parallel the following additional tricks are needed
 +<code>
 +export PMI_NO_FORK=1
 +# no cuda proxy
 +# export CRAY_CUDA_MPS=1
 +# use all cores with OMP
 +export OMP_NUM_THREADS=8
 +# use aprun in MPMD mode to have only the output from the master rank (here 169 nodes are used)
 +COMMAND="./cp2k.psmp -i test.inp -o test.out-profile"
 +PART1="-N 1  -n 1 -d ${OMP_NUM_THREADS} nvprof -o log.nvprof ${COMMAND}"
 +PART2="-N 1  -n 168 -d ${OMP_NUM_THREADS} ${COMMAND}"
 +aprun ${PART1} : ${PART2}
 +</code>
  
  
dev/profiling.1372834353.txt.gz · Last modified: 2020/08/21 10:14 (external edit)