Difference between revisions of "GTC SC11"

Revision as of 01:47, 29 November 2011

# Define the following to 1 to enable build

BENCH_GTC_MPI = 1
BENCH_CHARGEI_PTHREADS = 0
BENCH_PUSHI_PTHREADS = 0
BENCH_SERIAL = 0
SDK_HOME = /nics/c/home/biersdor/NVIDIA_GPU_Computing_SDK/
CUDA_HOME = /sw/keeneland/cuda/4.0/linux_binary
NVCC_HOME = $(CUDA_HOME)

TAU_MAKEFILE=/nics/c/home/biersdor/tau2/x86_64/lib/Makefile.tau-cupti-mpi-pdt-openmp-opari
TAU_OPTIONS='-optPdtCOpts=-DPDT_PARSE -optVerbose -optShared -optTauSelectFile=select.tau'
TAU_FLAGS=-tau_makefile=$(TAU_MAKEFILE) -tau_options=$(TAU_OPTIONS)
CC = tau_cc.sh  $(TAU_FLAGS)
MPICC = tau_cc.sh $(TAU_FLAGS)
NVCC = nvcc

NVCC_FLAGS = -gencode=arch=compute_20,code=\"sm_20,compute_20\" -gencode=arch=compute_20,code=\"sm_20,compute_20\" -m64 --compiler-options '-finstrument-functions -fno-strict-aliasing' -I$(NVCC_HOME)/include   -I. -DUNIX -O3 -DGPU_ACCEL=1 -I./  -I$(SDK_HOME)/C/common/inc -I$(SDK_HOME)/shared/inc

NVCC_LINK_FLAGS = -fPIC -m64 -L$(NVCC_HOME)/lib64 -L$(SDK_HOME)/shared/lib -L$(SDK_HOME)/C/lib -L$(SDK_HOME)/C/common/lib/linux -lcudart -lstdc++

CFLAGS = -DUSE_MPI=1 -DGPU_ACCEL=1
CFLAGSOMP = -fopenmp
COPTFLAGS = -std=c99
#CFLAGSOMP = -mp=bind
#COPTFLAGS = -fast  
CDEPFLAGS =  -MD
CLDFLAGS = -limf $(NVCC_LINK_FLAGS)
MPIDIR =  
CFLAGS  +=  -I$(CUDA_HOME)/include/
                  
EXEEXT = _keeneland_opt_gnu_tau_pdt
AR = ar      
ARCRFLAGS = cr
RANLIB = ranlib

PDT was chosen to allow for event filtering here is the select file used:

BEGIN_EXCLUDE_LIST

double RngStream_RandU01(RngStream)

double U01(RngStream)

END_EXCLUDE_LIST

Experiment simulation parameters

Along with the source code 3 sets of simulation parameters were given: A, B, C (largest). Also a choice of m-cell size: 20 or 96 (96 requires significantly more memory). A, B with m-cell size 20 were used for these performance results.

Performance Results

Here are some performance results that show the overall execution model:

This show a trace of a single execution on one MPI process (Multiple nodes/gpus can be utilized as well, performance behavior is similar across each process).

This shows a representative a profile for the GPU kernel.

Difference between revisions of "GTC SC11"

Revision as of 01:47, 29 November 2011

Contents

Background

Experiment setup

Experiment simulation parameters

Performance Results

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools

@@ Line 78: / Line 78: @@
 This show a trace of a single execution on one MPI process (Multiple nodes/gpus can be utilized as well, performance behavior is similar across each process).
+[[Image:gtc_gpu_profile.png|750px]]
+This shows a representative a profile for the GPU kernel.