Configure TAU with:
./configure -cuda=<path to cuda toolkit> -bfd=download
./configure -opencl=<opencl headaers/libaries> -bfd=download
(along with any other options you would normally give to TAU.)
Add <arch>/bin to your path and add <arch>/lib to your LD_LIBRARY_PATH.
Now to collect performance data run your application with tau_exec giving either the option '-cupti' (for CUDA applications) or '-opencl' for OpenCL applications.
tau_exec -T serial,cupti <-cupti|-opencl> ./a.out
MPI applications can be run like this:
mpirun -np 4 tau_exec -T mpi,cupti <-cupti|-opencl> ./a.out
(For CUDA version < 4.1 use -cuda instead of -cupti.)
For traces type:
before the tau_exec command.
And post-process the trace files by doing:
tau_multimerge tau2slog2 tau.trc tau.edf -o tau.slog2
To view profiles type:
To view slog2 traces type:
The CUPTI counters available for a given machine can assessed by typing:
Set the counters you wish to collect by exporting them as a colon separated list to the TAU_METRICS variable. ex:
Then run the application with tau_exec.
PGI OpenACC compiler
PGI uses the driver API to generate CUDA code for its accelerated regions so you need to set:
before running a PGI OpenACC application.