# Guide:TAUCrayOpenAcc

## Jacobin example

Let's look at a simple Jocobin example written in Cray OpenACC:

```!**********************************************************************
!     matmult.f90 - simple matrix multiply implementation
!************************************************************************
subroutine initialize(a, b, n)
real a(n,n)
real b(n,n)
integer
! first initialize the A matrix
do i = 1,n
do j = 1,n
a(j,i) = i
end do
end do
! then initialize the B matrix
do i = 1,n
do j = 1,n
b(j,i) = i
end do
end do
end subroutine initialize
subroutine multiply_matrices(a, b, c, matsize)
IMPLICIT NONE
real a(matsize, matsize)
real b(matsize, matsize)
real c(matsize, matsize)
real ctemp
integer i, j, k, l, m, matsize
!\$acc data copyin(a,b) copyout(c)
!\$acc kernels loop
do k = 1,matsize
do i = 1,matsize
do j = 1,matsize
c(i,k) = c(i,k) + a(i,j) * b(j,k)
enddo
enddo
enddo
!\$acc end kernels loop
!\$acc end data
end subroutine multiply_matrices
program main
integer SIZE_OF_MATRIX
parameter (SIZE_OF_MATRIX = 1000)
real a(SIZE_OF_MATRIX,SIZE_OF_MATRIX)
real b(SIZE_OF_MATRIX,SIZE_OF_MATRIX)
real c(SIZE_OF_MATRIX,SIZE_OF_MATRIX)
integer matsize
matsize = SIZE_OF_MATRIX
call initialize(a, b, matsize)
! multiply the matrices here using C(i,j) += (A(i,k)* B(k,j))
call multiply_matrices(a, b, c, matsize)
end program main
```

We will start with a simple OpenACC parallel loop directive right before the Jacobian computation.Here is the TAU profile:

We have profiles for the Jacobi kernel ("jacobi_\$ck_L215_2"), Memory copies, and CPU synchronization. Look at the time spent copying data to the GPU, it completely dominates the runtime, let look at the some details:

Nearly 26,000 Memory copies for a total of 99 GB. That is a lot of memory being moved. As a improvement let's try to keep as much data on the GPU as possible.

Next we have initialized the matrices on GPU, performed on the initialization on the GPU. This is the profile we see:

Much better performance Memory copies to the GPU and now a quarter of what it was. The second kernel ("jacobi_\$ck_L281_6") is the final reduction. And the number of bytes copied:

Only 25 GB in about 11,500 copies.

## Configuring

Here is how to configure and use TAU to collect Cray OpenACC:

```./configure -arch=craycnl -cuda=/opt/nvidia/cudatoolkit/4.1.28 -cudalibrary=-L/opt/nvidia/cudatoolkit/4.1.28/lib64\ -L/opt/nvidia/cudatoolkit/4.1.28/extras/CUPTI/lib64\ -lcupti\ -L/opt/cray/nvidia/default/lib64\ -lcuda -bfd=none -mpi -useropt=-DTAU_MPICH3
```

And run this way:

```export TAU_CUPTI_API=driver
aprun -n 8 tau_exec -T mpi,cray,cupti -cupti ./himeno
```