Guide:TAUChapel

From Tau Wiki
Revision as of 17:24, 5 October 2013 by Scottb (talk | contribs) (Performance Results)
Jump to: navigation, search

Chapel

MonteCarlo example

To test out some Chapel's language features let us program a MonteCarlo simulation to calculate PI. We can calculate PI by assessing how many points with coordinates x,y fit in the unit circle, ie x^2+y^2<=1.

Basic

Here is the basic routine that computes PI:

proc compute_pi(p_x: [] real(64), p_y: [] real(64)) : real {

 var c : sync int;
 c = 0;
 forall i in 1..n {
   if (x ** 2 + y ** 2 <= 1) then
       c += 1;
 }
 return c * 4.0 / n;

}

Notice that the forall here will compute each iteration in parallel, hence the need to define variable c as a sync variable. Performance here is limited by the need to synchronize access to c. Take a look of this profile:

Pi with tasks.png

70% percent of the time is spent in synchronization. Let's see if we can do better.

Procedure promotion

One feature of Chapel is procedure promotion, this is where calling a procedure that takes scalar arguments with an array, will have be as if each element of the array is passed to the procedure in parallel:

proc compute_pi(p_x: [] real(64), p_y: [] real(64)) : real {

 var c : sync int;
 forall i in in_circle(p_x, p_y) {
   c += i;
 }
 return c * 4.0 / n;

}
proc in_circle(x: real(64), y: real(64)): bool
{
  return (x ** 2 + y ** 2) <= 1;
}

Reduction

Furthermore with reorganization will allow us to take advantage of Chapel's built in reduction:

proc compute_pi(p_x: [] real(64), p_y: [] real(64)) : real {

 var c : int;
 c= +reduce in_circle(p_x, p_y);
 return c * 4.0 / n;

}

This also improves performance:

Pi with data.png

Multiple Locales

Let's look at how the array of x and y values are allocated:

var p_x: [1..n] real(64);
var p_y: [1..n] real(64);

However Chapel provides a way to distribute these array across multiple locales:


const space = {1..n};
var Dom: domain(1) dmapped Block(boundingBox=space) = space;

var p_x: [Dom] real(64);
var p_y: [Dom] real(64);

This Block mapping will allocate the elements block-wise among the locales. Furthermore the reduction used earlier will continue to work.

Performance Results

There are a couple of options for collecting Chapel performance data with TAU. To begin configure TAU with PDT, pthreads and bfd (for sampling).

Compiling Chapel with --savec c_code will store the intermediate C sources files in c_code. Compiling the C code with TAU is easy:

make -f c_code/Makefile CC=tau_cc.sh

But since each source file is included as a header, none of them will be instrumented. However these sources files can be modified to add TAU probes directly. Furthermore sampling can be added get more detail (time spent in the pthread library for example).