Difference between revisions of "Uintah/AMR"

From Tau Wiki
Jump to: navigation, search
Line 23: Line 23:
 
Of that <font color=red><tt>165.938</tt></font> seconds, <font color=blue><tt>85.071</tt></font> are spent in computation (<tt>runTask()</tt>), and <font color=blue><tt>53.622</tt></font> seconds are spent in <tt>MPI_Waitsome()</tt>.
 
Of that <font color=red><tt>165.938</tt></font> seconds, <font color=blue><tt>85.071</tt></font> are spent in computation (<tt>runTask()</tt>), and <font color=blue><tt>53.622</tt></font> seconds are spent in <tt>MPI_Waitsome()</tt>.
  
Also notable, <font color=red><tt>734.722</tt></font> seconds (1/3rd of the execution) is spent in the <tt>SchedulerCommon::compile()</tt> step of the task graph creation/compilation phase.  Of that time, <font color=blue><tt>312.48</tt></font> is spent in <tt>MPI_Allreduce()</tt>.  The nodes perform a simple checksum to check that they all have the same graph.  My guess is that the <tt>MPI_Allreduce()</tt> is simply acting as a synchronization point at the start of an iteration and that the large amount of time here is due to an imbalance in the work load (some nodes reach it much earlier than others).  The range on this <tt>MPI_Allreduce()</tt>, though not shown here, ranges from <font color=blue><tt>185</tt></font> to <font color=blue><tt>529</tt></font> seconds, with a mean of <font color=blue><tt>312</tt></font> seconds and a standard deviation of <font color=blue><tt>87</tt></font> seconds.
+
Also notable, <font color=red><tt>734.722</tt></font> seconds (1/3rd of the execution) is spent in the <tt>SchedulerCommon::compile()</tt> step of the task graph creation/compilation phase.  Of that time, <font color=blue><tt>312.48</tt></font> is spent in <tt>MPI_Allreduce()</tt>.  The nodes perform a simple checksum to check that they all have the same graph.  My guess is that the <tt>MPI_Allreduce()</tt> is simply acting as a synchronization point at the start of an iteration and that the large amount of time here is due to an imbalance in the work load (some nodes reach it much earlier than others).  The range on this <tt>MPI_Allreduce()</tt>, though not shown here, ranges from <font color=blue><tt>185</tt></font> to <font color=blue><tt>529</tt></font> seconds, with a mean of <font color=blue><tt>312</tt></font> seconds and a standard deviation of <font color=blue><tt>82</tt></font> seconds.
  
  
 
[[Image:UintahAMR-callpath.png]]
 
[[Image:UintahAMR-callpath.png]]

Revision as of 01:08, 26 March 2007

Basic Information

Machine Inferno (128 node 2.6 GHz Xeon cluster)
Input File hotBlob_AMRb.ups
Run size 64 CPUs
Run time about 35 minutes (2100 seconds)
Date March 2007

Callpath Results

TAU was configured with:

-mpiinc=/usr/local/lam-mpi/include/ -mpilib=/usr/local/lam-mpi/lib -PROFILECALLPATH -useropt=-O3

Highlighted below in the mean profile callpath view are several key numbers. First the 2112.575 second overall runtime. Next, we see that 165.938 seconds are spent in the task ICE::advectAndAdvanceInTime.

Of that 165.938 seconds, 85.071 are spent in computation (runTask()), and 53.622 seconds are spent in MPI_Waitsome().

Also notable, 734.722 seconds (1/3rd of the execution) is spent in the SchedulerCommon::compile() step of the task graph creation/compilation phase. Of that time, 312.48 is spent in MPI_Allreduce(). The nodes perform a simple checksum to check that they all have the same graph. My guess is that the MPI_Allreduce() is simply acting as a synchronization point at the start of an iteration and that the large amount of time here is due to an imbalance in the work load (some nodes reach it much earlier than others). The range on this MPI_Allreduce(), though not shown here, ranges from 185 to 529 seconds, with a mean of 312 seconds and a standard deviation of 82 seconds.


UintahAMR-callpath.png