Table of content:

 

1. Euler scalar example

·       Example: Euler scalar program

·       Compiling the Euler scalar example

·       Example: Euler scalar program makefile

·       Using Oprofile to collect sampling data

·       Analyzing sampling data

·       Example: Euler program sampling data

·       Example: Euler program sampling data at the instruction-level

 

2. Euler example that uses Array of Structure (AOS) to apply SIMD directives

·       Example: Euler code that uses AOS to apply SIMD directives

·       Compiling the AOS code to apply the SIMD directives

·       Example: AOS makefile

·       Analyzing sampling data

·       Example: AOS sampling data

·       Improving performance with higher optimization level of the compiler

o      Example: AOS makefile using gcc -03

o      Example: AOS gcc -03 sampling data

o      Example: AOS makefile using xlc -05

o      Example: AOS xlc -05 sampling data

 

3. Euler example that uses Structure of Arrays (SOA) to apply SIMD directives

·       Example: Euler code that uses SOA to apply SIMD directives

·       Compiling the SOA code to apply the SIMD directives

·       Example: SOA makefile

·       Analyzing sampling data

·       Example: SOA sampling data

 

4. Euler example that uses single SPE unit

·       Example: PPE code for the single SPE unit

·       Example: SPE code for the single SPE unit

·       Compiling the single SPE Euler example

o      Example: SPE makefile using gcc

o      Example: PPE makefile using gcc

·       Analyzing sampling data

·       Example: SPE sampling data

·       Improving the efficiency of SPU macros

·       Example: SPE code for the single SPE unit with more efficient SPE macros

·       Example: SPE sampling data with more efficient SPE macros

·       Compiling the single SPE Euler that uses the more efficient SPE macros

o      Example: SPE makefile using xlc -05

o      Example: PPE makefile using xlc -05

o      Example: xlc -05 sampling data

·       Collecting frequency profile using FDPR-Pro

·       Example: Frequency profiling data

·       Adding branch hint directives

 

 

5. Euler example that uses multiple SPEs

·       Example: PPE code for multiple SPEs

·       Compiling the multiple SPEs Euler example

·       Example: PPE makefile for multiple SPEs

 

6. Tuning the Euler example that uses multiple SPEs

·       Compiling the multiple SPEs Euler example to collect trace information

·       Example: SPE makefile to collect trace information

·       Collecting trace information for the multiple SPEs

·       Viewing trace data for the PPE and each SPE

 

7. The tuned Euler example using multiple SPE units

·       Example: SPE code for the tuned multiple SPEs

·       Compiling the multiple SPEs Euler example

·       Viewing trace data for the tuned multiple SPE

·       Example: Further tuned SPE code for the process_buffer function

·       Viewing trace data of the final tuned Euler example

 

 

Programming Examples

 

 

Euler Programming Examples

 

In this programming example we demonstrate the usage of the visualized performance analysis together with basic programming optimizations and compiler optimization options to provide a simple methodology for porting and tuning Cell programs.

We start with a simple scalar program example which can also be found in the IBM Cell programming tutorial [13] and show all the recommended tuning steps using visualized performance analysis tools, until reaching a fully optimized, tuned and scaled Cell program version.

The paper is organized as follows: section 1 describes a simple scalar example of the Euler-based particle-system simulation. Sections 2 and 3 describe two known approaches to SIMDizing the program. Section 4 shows how to enable the example to run on a single SPE, followed by a version that uses multiple SPEs in section 5. Finally, section 6 describes how to tune and scale the multiple SPE Cell program to run optimally and without delays.


1. Euler Scalar example

Throughout this dcoument we will use a simple Euler-based particle-system simulation.

The example uses numerical integration techniques using Euler's method of integration by computing the next value of a function of time F(t), using the Euler step formula:

F(t+dt) = F(t) + F'(t) * dt;

 

The particle system in this example consists of:

·       Array of 3-D positions for each particle (pos[]) expressed by 4-D homogenous coordinates (x,y,z,1)

·       Array of 3-D velocities for each particle (vel[]) expressed as 4-D homogenous coordinates (x,y,z,1)

·       Array of masses for each particle (mass[]) expressed as 4-D homogenous coordinates (x,y,z,1)

·       Force vector that changes over time (force)

 

The simple scalar code version of the program is given here along with compilation instructions:

 

Euler scalar program:

#define END_OF_TIME     10

#define PARTICLES       100000

 

/* 4-D vector definition.

 */

typedef struct {

  float x, y, z, w;

} vec4D;

 

vec4D pos[PARTICLES];           // particle positions

vec4D vel[PARTICLES];           // particle velocities

vec4D force;                    // current force being applied to the particles

float inv_mass[PARTICLES];      // inverse mass of the particles

float dt = 1.0f;                // step in time

 

int main()

{

  int i;

  float time;

  float dt_inv_mass;

 

  // For each step in time

  for (time=0; time<END_OF_TIME; time += dt) {

    // For each particle

    for (i=0; i<PARTICLES; i++) {

      // Compute the new position and velocity as acted upon by the force f.

      pos[i].x = vel[i].x * dt + pos[i].x;

      pos[i].y = vel[i].y * dt + pos[i].y;

      pos[i].z = vel[i].z * dt + pos[i].z;

 

      dt_inv_mass = dt * inv_mass[i];

 

      vel[i].x = dt_inv_mass * force.x + vel[i].x;

      vel[i].y = dt_inv_mass * force.y + vel[i].y;

      vel[i].z = dt_inv_mass * force.z + vel[i].z;

    }

  }

  return (0);

}

 

Compiling the Euler scalar example:

Providing that the above code is placed in a file called "euler_scalar.c", then to compile it, simply use the following command, which includes the options of preserving relocation and line number information in the generated binary "euler_scalar":

 

$ cc –g –Wl,-q –o euler_scalar euler_scalar.c

 

Or, preferably, use the following Makefile:

 

Example: Euler scalar program makefile:

########################################################################

#                       Target

########################################################################

PROGRAM_ppu            := euler_scalar

 

 

########################################################################

#                       Local Defines

########################################################################

INCLUDE                := -I ..

 

CFLAGS          = -g

LDFLAGS         = -g -Wl,-q

 

########################################################################

#                       buildutils/make.footer

########################################################################

ifdef CELL_TOP

        include $(CELL_TOP)/buildutils/make.footer

else

        include ../../../../buildutils/make.footer

    endif

When compiling the above program to include line number information we can already use the Oprofile and VPA Cell performance tools to visualize the program, even before being SIMdized or parallelized for the Cell BE. This information is going to be useful in the next steps in order to evaluate the overhead which is accompanied by the SIMDization and the parallelization process.

 

Using Oprofile to collect sampling data:

As root, type the following instructions to collect profiling using Oprofile

# opcontrol --deinit    // recommended to clear any previous Oprofile data and bringing the daemon down

# opcontrol --init

# opcontrol --reset

Signalling daemon... done

// to collect PPU average cycle events:

# opcontrol --separate=all--event=CYCLES:100000    // for SPU cycles, use: --event=SPU_CYCLES:100000

# opcontrol --start

Profiler running.

# euler_scalar    // Running the Euler example program

# opcontrol --stop

Stopping profiling.

# opcontrol --dump

 # opreport -X -g -l -d -o euler.opm   // To generate an XML output report called: euler.opm

 

Analyzing sampling data:

After gathering profiling, using the Oprofile tool, about the distribution of the cycles that are spent in the body of the inner loop, we can display them with the Profile Analyzer plugin of the Visual Performance Analyzer (VPA) tool, as follows:

 

Example: Euler program sampling data:

 

Example: Euler program sampling data at the instruction level:

It may be interesting to see the same cycles distribution at the instruction level:

 

 

Surprisingly the largest number of spent cycles is on the "fadds" instruction that is generated from the following C statement in the inner loop:  

pos[i].z = vel[i].z * dt + pos[i].z;

 

 

2. Euler example that uses Array of Structure (AOS) to apply SIMD directives

 

The most common technique for SIMDizing code when input is naturally expressed as a vector is called Array Of Structures (AOS).

 

The following example shows how to SIMDize in the AOS form. Vector/SIMD Multimedia Extension intrinsics are used, and they can be identified by their prefix vec_. The algorithm assumes that the number of particles is a multiple of four. Special code must be included to handle the last number of particles that is not a multiple of four.

 

Example: Euler code that uses AOS to apply SIMD directives:

#define END_OF_TIME 10

#define PARTICLES   100000

 

/* 4-D vector definition.

 */

typedef struct {

  float x, y, z, w;

} vec4D;

 

vec4D pos[PARTICLES] __attribute__ ((aligned (16)));

vec4D vel[PARTICLES] __attribute__ ((aligned (16)));

vec4D force __attribute__ ((aligned (16)));

 

float inv_mass[PARTICLES] __attribute__ ((aligned (16)));

float dt __attribute__ ((aligned (16))) = 1.0f;

 

int main() {

  int i;

  float time;

  float dt_inv_mass __attribute__ ((aligned (16)));

  vector float dt_v, dt_inv_mass_v;

  vector float *pos_v, *vel_v, force_v;

  vector float zero = (vector float){0.0f, 0.0f, 0.0f, 0.0f};

 

  pos_v = (vector float *)pos;

  vel_v = (vector float *)vel;

 

  force_v = *((vector float *)&force);

 

  // Replicate the variable time step across elements 0-2 of

  // a floating point vector. Force the last element (3) to zero.

  dt_v = vec_sld(vec_splat(vec_lde(0, &dt), 0), zero, 4);

 

  // For each step in time

  for (time=0; time<END_OF_TIME; time += dt) {

 

    // For each particle

    for (i=0; i<PARTICLES; i++) {

 

      // Compute the new position and velocity as acted upon by the force f.

      pos_v[i] = vec_madd(vel_v[i], dt_v, pos_v[i]);

 

      dt_inv_mass = dt * inv_mass[i];

      dt_inv_mass_v = vec_splat(vec_lde(0, &dt_inv_mass), 0);

 

      vel_v[i] = vec_madd(dt_inv_mass_v, force_v, vel_v[i]);

 

     }

   }

   return (0);

 }

 

Compiling the AOS code to apply the SIMD directives:

 

Example: AOS makefile

Provided that the file name containing the above code is "euler_simd_aos.c", then to compile it, use the following recommended Makefile:

 

########################################################################

#                       Target

########################################################################

PROGRAM_ppu            := euler_simd_aos

 

 

########################################################################

#                       Local Defines

########################################################################

INCLUDE                := -I ..

 

CFLAGS          = -g

LDFLAGS         = -g -Wl,-q

 

########################################################################

#                       buildutils/make.footer

########################################################################

ifdef CELL_TOP

        include $(CELL_TOP)/buildutils/make.footer

else

        include ../../../../buildutils/make.footer

    endif

 

 

Analyzing sampling data for AOS example:

 

Here is the collected Oprofile for the AOS SIMDized version:

 

Although the total of spent cycles stands on 909, the collected Oprofile shows that performance bottleneck is now on the statement

pos_v[i] = vec_madd(vel_v[i], dt_v, pos_v[i]); (59.6%)

 

Improving performance with higher optimization level of the compiler:

One basic way to improve the performance is to use higher optimization levels of the compiler.

Here is the modified Makefile using gcc –O3 level:

 

Example: AOS makefile  using gcc –O3:

########################################################################

#                       Target

########################################################################

PROGRAM_ppu            := euler_simd_aos

 

 

########################################################################

#                       Local Defines

########################################################################

INCLUDE                := -I ..

 

CC_OPT_LEVEL    := -O3

 

CFLAGS          = -g

LDFLAGS         = -g -Wl,-q

 

########################################################################

#                       buildutils/make.footer

########################################################################

ifdef CELL_TOP

        include $(CELL_TOP)/buildutils/make.footer

else

        include ../../../../buildutils/make.footer

   endif

 

Example: AOS gcc -03 sampling data:

After applying optimization option of gcc –O3 we can see an improvement in the total of cycles to 721 as shown below:

 

 

 

Example: AOS makefile using xlc –O5:

A higher optimization level of –O5 can be applied using the IBM xlc compiler.

Here is the modified Makefile using xlc –O5 level:

 

########################################################################

#                       Target

########################################################################

PROGRAM_ppu            := euler_simd_aos

 

########################################################################

#                       Local Defines

########################################################################

PPU_COMPILER    = xlc

 

INCLUDE         := -I ..

 

CC_OPT_LEVEL    := -O5

 

CFLAGS          = -g

LDFLAGS         = -g -Wl,-q

 

########################################################################

#                       buildutils/make.footer

########################################################################

ifdef CELL_TOP

        include $(CELL_TOP)/buildutils/make.footer

else

        include ../../../../buildutils/make.footer

   endif

 

Example: AOS xlc -05 sampling data:

An even lower number of 531 cycles with no isolated bottlenecks can be obtained using the xlC –O5 compiler option.

Here is a screen shot of the generated profiled code:

 


3. Euler example that uses Structure of Arrays (SOA) to apply SIMD directives

 

Another form of SIMDization is called Structure-of-Arrays (SOA). In this form, you collect the individual elements of the natural vectors into separate arrays (also called parallel-array form). The code is then written as if it were to execute scalar instructions, but it will be executing SIMD instructions. This results in code that computes four single-precision floats results simultaneously.

 

The following example shows how to SIMDize the Euler code example in the SOA form. As in Step 1a, the algorithm assumes that the number of particles is a multiple of 4.

 

Example: Euler code that uses SOA to apply SIMD directives:

#define END_OF_TIME 10

#define PARTICLES 100000

 

typedef struct {

  float x, y, z, w;

} vec4D;  

 

// Separate arrays for each component of the vector.

vector float pos_x[PARTICLES/4], pos_y[PARTICLES/4], pos_z[PARTICLES/4];

vector float vel_x[PARTICLES/4], vel_y[PARTICLES/4], vel_z[PARTICLES/4];

 

vec4D force __attribute__ ((aligned (16)));

 

float inv_mass[PARTICLES] __attribute__ ((aligned (16)));

 

float dt = 1.0f;

 

int main() {

 

  int i;

  float time;

  float dt_inv_mass __attribute__ ((aligned (16)));

  vector float force_v, force_x, force_y, force_z;

  vector float dt_v, dt_inv_mass_v;

 

  // Create a replicated vector for each component of the force vector.

  force_v = *(vector float *)(&force);

  force_x = vec_splat(force_v, 0);

  force_y = vec_splat(force_v, 1);

  force_z = vec_splat(force_v, 2);

 

  // Replicate the variable time step across all elements.

  dt_v = vec_splat(vec_lde(0, &dt), 0);

 

  // For each step in time

  for (time=0; time<END_OF_TIME; time += dt) {

 

    // For each particle

    for (i=0; i<PARTICLES/4; i++) {

 

      // Compute the new position and velocity as acted upon by the force f.

      pos_x[i] = vec_madd(vel_x[i], dt_v, pos_x[i]);

      pos_y[i] = vec_madd(vel_y[i], dt_v, pos_y[i]);

      pos_z[i] = vec_madd(vel_z[i], dt_v, pos_z[i]);

 

      dt_inv_mass = dt * inv_mass[i];

      dt_inv_mass_v = vec_splat(vec_lde(0, &dt_inv_mass), 0);

 

      vel_x[i] = vec_madd(dt_inv_mass_v, force_x, vel_x[i]);

      vel_y[i] = vec_madd(dt_inv_mass_v, force_y, vel_y[i]);

      vel_z[i] = vec_madd(dt_inv_mass_v, force_z, vel_z[i]);

    }

  }

  return (0);

}

 

Compiling the SOA code to apply the SIMD directives:

 

Example: SOA makefile:

To compile the above SOA SIMDized version of the Euler example, use the same Makefile that was used for the AOS case, and simply set the PROGRAM_ppu variable to the name of the C file containing the above code, as follows:

 

 


########################################################################

#                       Target

########################################################################

PROGRAM_ppu            := euler_simd_soa

 

 

########################################################################

#                       Local Defines

########################################################################

INCLUDE                := -I ..

 

CFLAGS          = -g

LDFLAGS         = -g -Wl,-q

 

########################################################################

#                       buildutils/make.footer

########################################################################

ifdef CELL_TOP

        include $(CELL_TOP)/buildutils/make.footer

else

        include ../../../../buildutils/make.footer

    endif

 

 

 

Analyzing sampling data:

 

Example: SOA sampling data:

Interestingly, the following screen shot of the Oprofile collected for the SIMdized SOA case shows similar bottlenecks on the statement of:

pos_x[i] = vec_madd(vel_x[i], dt_v, pos_x[i]);  (59.37%)


 

Furthermore, the same tuning process using compiler optimizations of gcc –O3 and xlC –O5 provides similar results to the SOA SIMDization case.

 

 


4. Euler example that uses single SPE unit

 

A simple code example of the Euler program running on the Cell and using a single SPE processor is to simply move the entire loops computation to the SPE part.
The generated code for this simple case would look as follows:

 

 

Example: PPE code for the single SPE unit:

#define END_OF_TIME     10

#define PARTICLES       100000

 

typedef struct {      /* 4-D vector definition. */

  float x, y, z, w;

} vec4D;

 

typedef struct {

  int particles;       // number of particle to process

  vector float *pos_v; // pointer to array of positions vectors

  vector float *vel_v; // pointer to array of velocity vectors

  float *inv_mass;     // pointer to array of mass vectors

  vector float force_v;       // force vector

  float dt;            // current step in time

} parm_context;

 

vec4D pos[PARTICLES] __attribute__ ((aligned (16)));

vec4D vel[PARTICLES] __attribute__ ((aligned (16)));

vec4D force __attribute__ ((aligned (16)));

float inv_mass[PARTICLES] __attribute__ ((aligned (16)));

float dt = 1.0f;

 

extern spe_program_handle_t particle;

 

typedef struct ppu_pthread_data {

  spe_context_ptr_t spe_ctx;

  pthread_t pthread;

  void *argp;

} ppu_pthread_data_t;

 

 

void *ppu_pthread_function(void *arg) {

  ppu_pthread_data_t *datap = (ppu_pthread_data_t *)arg;

  unsigned int entry = SPE_DEFAULT_ENTRY;

  if (spe_context_run(datap->spe_ctx, &entry, 0, datap->argp, NULL, NULL) < 0) {

    perror ("Failed running context");

    exit (1);

  }

  pthread_exit(NULL);

}

 

int main()

{

  ppu_pthread_data_t data;

  parm_context ctx __attribute__ ((aligned (16)));

 

  ctx.particles = PARTICLES;

  ctx.pos_v = (vector float *)pos;

  ctx.vel_v = (vector float *)vel;

  ctx.force_v = *((vector float *)&force);

  ctx.inv_mass = inv_mass;

  ctx.dt = dt;

 

  /* Create a SPE context */

  if ((data.spe_ctx = spe_context_create (0, NULL)) == NULL) {

    perror ("Failed creating context");

    exit (1);

  }

  /* Load SPE program into the SPE context*/

  if (spe_program_load (data.spe_ctx, &particle))  {

    perror ("Failed loading program");

    exit (1);

  }

  /* Initialize context run data */

  data.argp = &ctx;

  /* Create pthread for each of the SPE contexts */

  if (pthread_create (&data.pthread, NULL, &ppu_pthread_function, &data)) {

    perror ("Failed creating thread");

    exit (1);

  }

  /* Wait for the threads to complete */

  if (pthread_join (data.pthread, NULL)) {

    perror ("Failed joining thread\n");

    exit (1);

  }

  /* Destroy SPE context */

  if (spe_context_destroy (data.spe_ctx) != 0) {

    perror("Failed destroying context");

    exit (1);

  }

 

  return (0);

}

 

The following SPE code runs the loops body on the SPE processor.

 

Example: SPE code for the single SPE unit:

#define END_OF_TIME     10

#define PARTICLES       100000

 

typedef struct {    /* 4-D vector definition. */

  float x, y, z, w;

} vec4D;

 

typedef struct {

  int particles;       // number of particle to process

  vector float *pos_v; // pointer to array of positions vectors

  vector float *vel_v; // pointer to array of velocity vectors

  float *inv_mass;     // pointer to array of mass vectors

  vector float force_v;       // force vector

  float dt;                  // current step in time

} parm_context;

 

#define PARTICLES_PER_BLOCK   1024 /* # of particles to process at a time */

 

// Local store structures and buffers.

volatile parm_context ctx;

volatile vector float pos[PARTICLES_PER_BLOCK];

volatile vector float vel[PARTICLES_PER_BLOCK];

volatile float inv_mass[PARTICLES_PER_BLOCK];

 

int main(unsigned long long spu_id __attribute__ ((unused)), unsigned long long parm)

{

  int i, j;

  int left, cnt;

  float time;

  unsigned int tag_id;

  vector float dt_v, dt_inv_mass_v;

 

  // Reserve a tag ID

  tag_id = mfc_tag_reserve();

 

  spu_writech(MFC_WrTagMask, -1);

 

  // Input parameter parm is a pointer to the particle parameter context.

  // Fetch the context, waiting for it to complete.

 

  spu_mfcdma32((void *)(&ctx), (unsigned int)parm, sizeof(parm_context), tag_id, MFC_GET_CMD);

  (void)spu_mfcstat(MFC_TAG_UPDATE_ALL);

 

  dt_v = spu_splats(ctx.dt);

 

  // For each step in time

  for (time=0; time<END_OF_TIME; time += ctx.dt) {

    // For each block of particles

    for (i=0; i<ctx.particles; i+=PARTICLES_PER_BLOCK) {

      // Determine the number of particles in this block.

      left = ctx.particles - i;

      cnt = (left < PARTICLES_PER_BLOCK) ? left : PARTICLES_PER_BLOCK;

 

      // Fetch the data - position, velocity and inverse_mass. Wait for the DMA to complete

      // before performing computation.

      spu_mfcdma32((void *)(pos), (unsigned int)(ctx.pos_v+i),

                    cnt * sizeof(vector float), tag_id, MFC_GETB_CMD);

      spu_mfcdma32((void *)(vel), (unsigned int)(ctx.vel_v+i),

                    cnt * sizeof(vector float), tag_id, MFC_GET_CMD);

      spu_mfcdma32((void *)(inv_mass), (unsigned int)(ctx.inv_mass+i),

                    cnt * sizeof(float), tag_id, MFC_GET_CMD);

      (void)spu_mfcstat(MFC_TAG_UPDATE_ALL);

 

      // Compute the step in time for the block of particles

      for (j=0; j<cnt; j++) {

        pos[j] = spu_madd(vel[j], dt_v, pos[j]);

        dt_inv_mass_v = spu_mul(dt_v, spu_splats(inv_mass[j]));

        vel[j] = spu_madd(dt_inv_mass_v, ctx.force_v, vel[j]);

      }

 

      // Put the position and velocity data back into system memory

      spu_mfcdma32 ((void *)(pos), (unsigned int)(ctx.pos_v+i),

                    cnt * sizeof(vector float), tag_id, MFC_PUT_CMD);

      spu_mfcdma32((void *)(vel), (unsigned int)(ctx.vel_v+i),

                    cnt * sizeof(vector float), tag_id, MFC_PUT_CMD);

    }

  }

  // Wait for final DMAs to complete before terminating SPU thread.

  (void)spu_mfcstat(MFC_TAG_UPDATE_ALL);

  return (0);

}

 

Compiling the single SPE euler example:

 

Example: SPE makefile using gcc:

 

In order to compile the above code, place the SPE code part file "particle.c" in a subdirectory called "spu" under the directory where the PPE code file "euler_spe.c" is placed.

Here is the Makefile for the SPE code:

 

########################################################################

#                      Target

########################################################################

PROGRAM_spu    := particle

LIBRARY_embed  := lib_particle_spu.a

 

 

########################################################################

#                      Local Defines

########################################################################

INCLUDE        := -I ..

 

CFLAGS          = -g

LDFLAGS         = -g -Wl,-q

 

########################################################################

#                      buildutils/make.footer

########################################################################

ifdef CELL_TOP

        include $(CELL_TOP)/buildutils/make.footer

else

        include ../../../../../buildutils/make.footer

   endif

 

Example: PPE makefile using gcc:

Here is the Makefile for the PPE code part in the file "euler_spe.c", located one directory above the "spu" directory:

 

########################################################################

#                       Subdirectories

########################################################################

DIRS           := spu

 

########################################################################

#                       Target

########################################################################

PROGRAM_ppu    := euler_spe

 

########################################################################

#                       Local Defines

########################################################################

 

IMPORTS         = spu/lib_particle_spu.a -lspe2 -lpthread

 

CFLAGS          = -g

LDFLAGS         = -g -Wl,-q

 

########################################################################

#                       buildutils/make.footer

########################################################################

 

ifdef CELL_TOP

        include $(CELL_TOP)/buildutils/make.footer

else

        include ../../../../buildutils/make.footer

endif

 

Analyzing sampling data:

 

In general, when moving the hot loops to run in the SPE rather than on the PPE, the number of cycles is expected to increase with compare to the same loop when run on the PPE.

 

Example: SPE sampling data:

Here is what the Oprofile collects for the SPE code of the above Cell version of the Euler program (using the SPU_CYCLES event of Oprofile):

 

 

As can be seen, the total number of cycles increases from 909 to 7680.

It is, therefore, highly important to make sure that the SPU macros that are used, will be efficient.

 

Improving the effiecency of SPU macros:

The following code replaces the non-efficient spu_mfcdma32 macro by the more recent and more efficient API macros: mfc_get, mfc_getb, mfc_put, mfc_write_tag_update_all:

 

Example: SPE code for the single SPE unit with more efficient SPE macros:

#define END_OF_TIME     10

#define PARTICLES       100000

 

typedef struct {    /* 4-D vector definition. */

  float x, y, z, w;

} vec4D;

 

typedef struct {

  int particles;       // number of particle to process

  vector float *pos_v; // pointer to array of positions vectors

  vector float *vel_v; // pointer to array of velocity vectors

  float *inv_mass;            // pointer to array of mass vectors

  vector float force_v;   // force vector

  float dt;                   // current step in time

} parm_context;

 

#define PARTICLES_PER_BLOCK   1024 /* # of particles to process at a time */

 

// Local store structures and buffers.

volatile parm_context ctx;

volatile vector float pos[PARTICLES_PER_BLOCK];

volatile vector float vel[PARTICLES_PER_BLOCK];

volatile float inv_mass[PARTICLES_PER_BLOCK];

 

int main(unsigned long long spu_id __attribute__ ((unused)), unsigned long long parm)

{

  int i, j;

  int left, cnt;

  float time;

  unsigned int tag_id;

  vector float dt_v, dt_inv_mass_v;

 

  // Reserve a tag ID

  tag_id = mfc_tag_reserve();

 

  mfc_write_tag_mask(-1);

 

 

  // Input parameter parm is a pointer to the particle parameter context.

  // Fetch the context, waiting for it to complete.

 

  mfc_get((void *)(&ctx), (unsigned int)parm, sizeof(parm_context), tag_id, 0, 0);

  // mfc_write_tag_update_all();

  mfc_read_tag_status_all();

 

  dt_v = spu_splats(ctx.dt);

 

  // For each step in time

  for (time=0; time<END_OF_TIME; time += ctx.dt) {

    // For each block of particles

    for (i=0; i<ctx.particles; i+=PARTICLES_PER_BLOCK) {

      // Determine the number of particles in this block.

      left = ctx.particles - i;

      cnt = (left < PARTICLES_PER_BLOCK) ? left : PARTICLES_PER_BLOCK;

 

      // Fetch the data - position, velocity and inverse_mass. Wait for the DMA to complete

      // before performing computation.

      mfc_getb((void *)(pos), (unsigned int)(ctx.pos_v+i), cnt * sizeof(vector float),tag_id,0,0);

      mfc_get((void *)(vel), (unsigned int)(ctx.vel_v+i), cnt * sizeof(vector float), tag_id, 0,0);

      mfc_get((void *)(inv_mass),(unsigned int)(ctx.inv_mass+i), cnt * sizeof(float),tag_id,0,0);

      mfc_write_tag_update_all();

      mfc_read_tag_status_all();

 

      // Compute the step in time for the block of particles

      for (j=0; j<cnt; j++) {

               pos[j] = spu_madd(vel[j], dt_v, pos[j]);

               dt_inv_mass_v = spu_mul(dt_v, spu_splats(inv_mass[j]));

               vel[j] = spu_madd(dt_inv_mass_v, ctx.force_v, vel[j]);

      }

 

      // Put the position and velocity data back into system memory

      mfc_put((void *)(pos), (unsigned int)(ctx.pos_v+i), cnt * sizeof(vector float), tag_id, 0,0);

      mfc_put((void *)(vel), (unsigned int)(ctx.vel_v+i), cnt * sizeof(vector float), tag_id, 0,0);

    }

  }

  // Wait for final DMAs to complete before terminating SPU thread.

  mfc_read_tag_status_all();

  return (0);

}

 

Example: SPE sampling data with more efficient SPE macros:

 

As can be seen, by the following collected profiling, the total number of cycles is significantly reduced to 1332:

 

 

Compiling the single SPE euler that uses more efficient SPE macros:

 

As before, we can improve the performance by increasing the optimization level of the compiler.

Here are the recommended Makefiles for the SPE Euler example, when using xlc –O5:

 

Example: SPE makefile using xlc -05:

########################################################################

#                      Target

########################################################################

PROGRAM_spu    := particle

LIBRARY_embed  := lib_particle_spu.a

 

 

########################################################################

#                      Local Defines

########################################################################

SPU_COMPILER    := xlc

 

INCLUDE         := -I ..

 

CC_OPT_LEVEL    := -O5

 

CFLAGS          = -g

LDFLAGS         = -g -Wl,-q

 

########################################################################

#                      buildutils/make.footer

########################################################################

ifdef CELL_TOP

        include $(CELL_TOP)/buildutils/make.footer

else

        include ../../../../../buildutils/make.footer

   endif

 

Example: PPE makefile using xlc -05:

########################################################################

#                       Subdirectories

########################################################################

DIRS           := spu

 

########################################################################

#                       Target

########################################################################

PROGRAM_ppu    := euler_spe

 

########################################################################

#                       Local Defines

########################################################################

PPU_COMPILER    = xlc

 

IMPORTS         = spu/lib_particle_spu.a -lspe2 -lpthread

 

CC_OPT_LEVEL    := -O5

 

CFLAGS          = -g

LDFLAGS         = -g -Wl,-q

 

########################################################################

#                       buildutils/make.footer

########################################################################

ifdef CELL_TOP

        include $(CELL_TOP)/buildutils/make.footer

else

        include ../../../../buildutils/make.footer

endif

 

Example: xlc -05 sampling data:

Applying xlC –O5 reduces it manages to slightly reduce the total number cycles to 1324 cycles with no outstanding bottlenecks at the instruction level:

 

 

Collecting frequency profile using FDPR-Pro:

 

Another important tool that can help with the porting and the tuning process is the FDPR-Pro which is able to collect accurate frequency profile by instrumenting the binary file and then displaying the collected frequency along with appropriate performance comments, on the Code Analyzer plugin of the VPA tool.

In order to instrument the Euler single SPE example and collect thefreequency profile, use the FDPR-Pro tool directly on the euler_spe executable.

The FDPR-Pro RPM file for Cell consists of 3 main files:

 - /usr/bin/fdprpro

- /usr/lib/libfdprinst32.so

- /usr/lib64/libfdprinst64.so

Important: remove all previously generated SPU profile files (suffixed by .mprof):

$ rm *.mprof

 Recommended: create a temporary directory for FDPR-Pro tmp files generated during its analysis phase:

$ mkdir ./tmpdir

 Run the following command to instrument the euler_spe exec by FDPR-Pro and to create an instrumented file called fft.instr along with an empty profile file euler.nprof:

$ fdprpro euler_spe -cell  -spedir tmpdir  -a instr

This will generate a new file named euler_spe.instr

Running this will automatically generate two files containing frequency profiling information:

-        euler_spe.nprof  // profiling frequency information for the PPU code part

-        particle.mprof  // profiling frequency information for each of the SPU code part

The binary executable file euler_spe can now be loaded and analyzed by Code Analyzer plugin in VPA, along with the generated profile files.

Example: Frequency profiling data:

Here is what Code Analyzer displays for the collected frequency profiling of the Euler SPE program:

 

Adding branch hint directives:

From the above view, here is an important performance comment which Code Analyzer displays about a missing hint instruction from the SPE code:

 

Branch hint bits in the Cell SPE are the equivalent of the branch prediction bits that are used on the PPE. However, with the lack of profiling information, the compiler cannot always determine if a hint bit is needed.

Hint bits on the SPE processor usually have more impact on the performance with compare to the PPE processor, as there is no hardware in the SPE to support branch prediction.

 

To explicitly specify a branch hint in the Cell SPE, one can use the following intrinsic:

__builtin_expect( expression, expected_value )

which generates branching code based off of the first argument, buffer < buffer_end, but produces branch hints based off of the second argument, 1.

This intrinsic instructs the compiler to generate hints that expect the value of buffer < buffer_end to be 1.

 

For example:

 

while(__builtin_expect(buffer < buffer_end, 1)) {

     *buffer = operation_on_buffer(*buffer);

     buffer++;

}

 

The modified SPE code for the Euler example can, therefore, include the following hints statements for all its loops:

 

  for (time=0; __builtin_expect(time<END_OF_TIME,1); time += ctx.dt) {   

         

     for (i=0; __builtin_expect(i<ctx.particles,1); i+=PARTICLES_PER_BLOCK) {

             

         for (j=0; __builtin_expect(j<cnt,1); j++) {

                   

 

 

 

5. Euler example that uses multiple SPEs

 

When coming to parallelize scalar code we always search first for the outer loop to be parallelized. This is in contrast to the SIMDization process in which we always try to vectorize the operations of the most inner loop.

In our example the nested loops running on a single SPE are easily parallelized such that each chunk runs on a different SPE, as follows.

 

Since we will use the same SPE code from the previous version, only the PPE code changes as follows:

 

Example: PPE code for multiple SPEs:
#define END_OF_TIME 10

#define PARTICLES 100000

 

typedef struct {    /* 4-D vector definition. */

  float x, y, z, w;

} vec4D;

 

#define MAX_SPE_THREADS               16

 

vec4D pos[PARTICLES] __attribute__ ((aligned (16)));

vec4D vel[PARTICLES] __attribute__ ((aligned (16)));

vec4D force __attribute__ ((aligned (16)));

float inv_mass[PARTICLES] __attribute__ ((aligned (16)));

float dt = 1.0f;

 

extern spe_program_handle_t particle;

 

typedef struct ppu_pthread_data {

  spe_context_ptr_t spe_ctx;

  pthread_t pthread;

  void *argp;

} ppu_pthread_data_t;

 

void *ppu_pthread_function(void *arg) {

  ppu_pthread_data_t *datap = (ppu_pthread_data_t *)arg;

  unsigned int entry = SPE_DEFAULT_ENTRY;

  if (spe_context_run(datap->spe_ctx, &entry, 0, datap->argp, NULL, NULL) < 0) {

    perror ("Failed running context");

    exit (1);

  }

  pthread_exit(NULL);

}

 

 

int main()

{

  int i, offset, count, spe_threads;

  ppu_pthread_data_t datas[MAX_SPE_THREADS];

  parm_context ctxs[MAX_SPE_THREADS] __attribute__ ((aligned (16)));

 

  /* Determine the number of SPE threads to create.

   */

  spe_threads = spe_cpu_info_get(SPE_COUNT_USABLE_SPES, -1);

  if (spe_threads > MAX_SPE_THREADS) spe_threads = MAX_SPE_THREADS;

 

  /* Create multiple SPE threads */

  for (i=0, offset=0; i<spe_threads; i++, offset+=count) {

    /* Construct a parameter context for each SPE. Make sure

     * that each SPEs (excluding the last) particle count is a multiple

     * of 4 so that inv_mass context pointer is always quadword aligned.

     */

    count = (PARTICLES / spe_threads + 3) & ~3;

    ctxs[i].particles = (i==(spe_threads-1)) ? PARTICLES - offset : count;

    ctxs[i].pos_v = (vector float *)&pos[offset];

    ctxs[i].vel_v = (vector float *)&vel[offset];

    ctxs[i].force_v = *((vector float *)&force);

    ctxs[i].inv_mass = &inv_mass[offset];

    ctxs[i].dt = dt;

   

    /* Create SPE context */

    if ((datas[i].spe_ctx = spe_context_create (0, NULL)) == NULL) {

        perror ("Failed creating context");

        exit (1);

    }

    /* Load SPE program into the SPE context */

    if (spe_program_load (datas[i].spe_ctx, &particle)) {

      perror ("Failed loading program");

      exit (1);

    }

    /* Initialize context run data */

    datas[i].argp = &ctxs[i];

    /* Create pthread for each of the SPE conexts */

    if (pthread_create (&datas[i].pthread, NULL, &ppu_pthread_function, &datas[i])) {

      perror ("Failed creating thread");

    }

  }

 

  /* Wait for all the SPE threads to complete.*/

  for (i=0; i<spe_threads; i++) {

    if (pthread_join (datas[i].pthread, NULL)) {

      perror ("Failed joining thread");

      exit (1);

    }

 

    /* Destroy context */

    if (spe_context_destroy (datas[i].spe_ctx) != 0) {

      perror("Failed destroying context");

      exit (1);

    }

  }

 

  return (0);

}

 

 

Compiling the multiple SPEs euler example:

To compile the above Euler example for the multiple SPEs case, use the same recommended Makefiles that were used for the previous Euler example for the single SPE case, after modifying the file name of the PPE part in the PROGRAM_ppu variable, accordingly:

 

Example: PPE makefile for multiple SPEs:

########################################################################

#                       Subdirectories

########################################################################

DIRS           := spu

 

########################################################################

#                       Target

########################################################################

PROGRAM_ppu    := euler_multi_spe

 

########################################################################

#                       Local Defines

########################################################################

PPU_COMPILER    = xlc

 

IMPORTS         = spu/lib_particle_spu.a -lspe2 -lpthread

 

CC_OPT_LEVEL    := -O5

 

CFLAGS          = -g

LDFLAGS         = -g -Wl,-q

 

########################################################################

#                       buildutils/make.footer

########################################################################

ifdef CELL_TOP

        include $(CELL_TOP)/buildutils/make.footer

else

        include ../../../../buildutils/make.footer

endif

 


6. Tuning the Euler example that uses multiple SPEs

 

Since we already tuned the SPE code part separately, our focus now is to make sure that the program is properly scaled and free of synchronization effects that can easily eat away any potential performance gain from the parallelized program.

A useful way to visualize a running program on Cell with multiple SPEs, is to collect traces by the PDT tool and then display them using the Trace Analyzer plugin of VPA.

 

Compiling the multiple SPEs euler example to collect trace information:

 

In order to collect PDT traces there is a need to modify the Makefile for the SPE code as follows, whereas the PPE Makefile is left unchanged:

 

Example: SPE makefile to collect trace information:

########################################################################

#                      Target

########################################################################

PROGRAM_spu    := particle

LIBRARY_embed  := lib_particle_spu.a

 

 

########################################################################

#                      Local Defines

########################################################################

SPU_COMPILER    := xlc

 

INCLUDE         := -I ..

 

CC_OPT_LEVEL    := -O5

########################################################################

#                       Target

########################################################################

 

PROGRAM_spu     := particle

LIBRARY_embed   := lib_particle_spu.a

 

 

########################################################################

#                       Local Defines

########################################################################

SPU_COMPILER    := xlc

 

INCLUDE         := -I ..

 

CC_OPT_LEVEL    := -O5

 

CPPFLAGS_xlc    := -I/usr/spu/include/trace -Dmain=_pdt_main -Dexit=_pdt_exit -DMFCIO_TRACE -DLIBSYNC_TRACE

LDFLAGS_xlc     := -L/usr/spu/lib/trace -ltrace

 

CFLAGS          = -g

LDFLAGS         = -g -Wl,-q

 

 

########################################################################

#                       buildutils/make.footer

########################################################################

ifdef CELL_TOP

        include $(CELL_TOP)/buildutils/make.footer

else

        include ../../../../../buildutils/make.footer

endif

 

Collecting trace information for the multiple SPEs:

 

Before running the PDT compiled Euler program, the following environment variables need to be exported:

$ export LD_LIBRARY_PATH=/usr/lib/trace

$ export PDT_KERNEL_MODULE=/usr/lib/modules/pdt.ko

$ export PDT_CONFIG_FILE=/usr/share/pdt/config/pdt_cbe_configuration.xml

 

Analyzing trace information:

 

It is interesting to look at the traced SPE events collected for STEP3_multi_spe version using the Trace Analyzer plugin of VPA.

The following screen shots show the trace as collected for the PPE and each SPE, for the Euler program. The green color on the SPEs rows represents regular computation with no specific calls to SPE API macros. The blue color represents MFCIO events such as SPE_MFC_GET, SPE_MFC_GETB and  SPE_MFC_PUT. No color (white color) represents non-computation mode, while all other colors represent synchronization events resulting from the SPE API macros.

 

Viewing trace data for the PPE and each SPE:

SPEs 0 thru 7:


SPEs 7 thru 15:

 


When zooming into the blue areas, as shown in the following screen shot, we immediately see that the dominant delaying events are the SPE_MFC_READ_TAG_STATUS events which take up most of the beginning of each of the SPE execution time. This event means that the threads are waiting for the DMA read to complete.  One of the reasons appear to be the synchronization created between the SPE_MFC_GET and the SPE_MFC_PUT events on each SPE (shown in purple and yellow colors) which cause a delay in the DMA read completion for the proceeding SPEs.

 


 

7. The Tuned Euler example using multiple SPE units

 

In order to resolve the synchronization bottleneck shown above and make the program scalable for the multiple running SPEs, it is clear that we need to focus on parallelizing the input and output DMA transfers, and as a result shorten the delays caused by the DMA read transfers.

One way to resolve this is to interleave DMA transfers with computation by multibuffering the inputs and outputs to eliminate (or reduce) DMA stalls. These stalls are not reflected in the static and dynamic analyses. In the process of adding double buffering, the inner loop is moved into a function, so that the code need not be repeated.

Following is the modified code for the multiple SPE Euler example.

 

The PPE code part for the tuned multi SPE Euler example remains unchanged from the previous step where multiple SPE code context are applied for all available SPEs.

 


Example: SPE code for the tuned multiple SPEs:

#define END_OF_TIME 10

#define PARTICLES 100000

 

/* 4-D vector definition.

 */

typedef struct {

  float x, y, z, w;

} vec4D;

 

 

#define PARTICLES_PER_BLOCK     1024    /* # of particles to process at a time *

 

// Local store structures and buffers.

volatile parm_context ctx;

volatile vector float pos[2][PARTICLES_PER_BLOCK];

volatile vector float vel[2][PARTICLES_PER_BLOCK];

volatile vector float inv_mass[2][PARTICLES_PER_BLOCK/4];

 

 

void process_buffer(int buffer, int cnt, vector float dt_v)

{

  int j;

  vector float dt_inv_mass_v;

 

 

  for (j=0; j<cnt; j++) {

    pos[buffer][j] = spu_madd(vel[buffer][j], dt_v, pos[buffer][j]);

    dt_inv_mass_v = spu_mul(dt_v, spu_splats(inv_mass[buffer][j]));

    vel[buffer][j] = spu_madd(dt_inv_mass_v, ctx.force_v, vel[buffer][j]);

  }

}

 

 

int main(unsigned long long spu_id __attribute__ ((unused)), unsigned long long argv)

{

  int buffer, next_buffer;

  int cnt, next_cnt, left;

  float time, dt;

  vector float dt_v;

  volatile vector float *ctx_pos_v, *ctx_vel_v;

  volatile vector float *next_ctx_pos_v, *next_ctx_vel_v;

  volatile float *ctx_inv_mass, *next_ctx_inv_mass;

  unsigned int tags[2];

 

  // Reserve a pair of DMA tag IDs

  tags[0] = mfc_tag_reserve();

  tags[1] = mfc_tag_reserve();

 

  // Input parameter argv is a pointer to the particle context.

  // Fetch the parameter context, waiting for it to complete.

   mfc_write_tag_mask(1 << tags[0]);

  mfc_get((void *)(&ctx), (unsigned int)argv, sizeof(parm_context), tags[0], 0, 0);

   mfc_read_tag_status_all();

 

  dt = ctx.dt;

  dt_v = spu_splats(dt);

 

  // For each step in time

  for (time=0; time<END_OF_TIME; time += dt) {

    // For each double buffered block of particles

    left = ctx.particles;

 

    cnt = (left < PARTICLES_PER_BLOCK) ? left : PARTICLES_PER_BLOCK;

 

    ctx_pos_v = ctx.pos_v;

    ctx_vel_v = ctx.vel_v;

    ctx_inv_mass = ctx.inv_mass;

 

    // Prefetch first buffer of input data.

    buffer = 0;

    mfc_getb((void *)(pos), (unsigned int)(ctx_pos_v), cnt * sizeof(vector float), tags[0], 0,0);

    mfc_get((void *)(vel), (unsigned int)(ctx_vel_v), cnt * sizeof(vector float), tags[0], 0,0);

    mfc_get((void *)(inv_mass), (unsigned int)(ctx_inv_mass), cnt * sizeof(float), tags[0], 0,0);

 

    while (cnt < left) {

      left -= cnt;

 

      next_ctx_pos_v = ctx_pos_v + cnt;

      next_ctx_vel_v = ctx_vel_v + cnt;

      next_ctx_inv_mass = ctx_inv_mass + cnt;

      next_cnt = (left < PARTICLES_PER_BLOCK) ? left : PARTICLES_PER_BLOCK;

 

      // Prefetch next buffer so the data is available for computation on next loop iteration.

      // The first DMA is barriered so that we don't GET data before the previous iteration's

      // data is PUT.

      next_buffer = buffer^1;

 

      mfc_getb((void *)(&pos[next_buffer][0]), (unsigned int)(next_ctx_pos_v), next_cnt * sizeof(vector float), tags[next_buffer], 0,0);

      mfc_get((void *)(&vel[next_buffer][0]), (unsigned int)(next_ctx_vel_v), next_cnt * sizeof(vector float), tags[next_buffer], 0,0);

      mfc_get((void *)(&inv_mass[next_buffer][0]), (unsigned int)(next_ctx_inv_mass), next_cnt * sizeof(float), tags[next_buffer], 0,0);

     

      // Wait for previously prefetched data

      mfc_write_tag_mask(1 << tags[buffer]);

      mfc_read_tag_status_all();

 

      process_buffer(buffer, cnt, dt_v);

 

      // Put the buffer's position and velocity data back into system memory

      mfc_put((void *)(&pos[buffer][0]), (unsigned int)(ctx_pos_v), cnt * sizeof(vector float), tags[buffer], 0, 0);

      mfc_put((void *)(&vel[buffer][0]), (unsigned int)(ctx_vel_v), cnt * sizeof(vector float), tags[buffer], 0, 0);

     

      ctx_pos_v = next_ctx_pos_v;

      ctx_vel_v = next_ctx_vel_v;

      ctx_inv_mass = next_ctx_inv_mass;

 

      buffer = next_buffer;

      cnt = next_cnt;          

    }

 

    // Wait for previously prefetched data

    mfc_write_tag_mask(1 << tags[buffer]);

    mfc_read_tag_status_all();

 

    process_buffer(buffer, cnt, dt_v);

 

    // Put the buffer's position and velocity data back into system memory

    mfc_put((void *)(&pos[buffer][0]), (unsigned int)(ctx_pos_v), cnt * sizeof(vector float), tags[buffer], 0, 0);

    mfc_put((void *)(&vel[buffer][0]), (unsigned int)(ctx_vel_v), cnt * sizeof(vector float), tags[buffer], 0, 0);

 

 

    // Wait for DMAs to complete before starting the next step in time.

     mfc_write_tag_mask(1 << tags[buffer]);

    mfc_read_tag_status_all();

  }

 

  return (0);

}

 

Viewing trace data for the tuned multiple SPE:

The following collected PDT traces after compiling the code with xlC –O5 level, show a clear improvement from more than ~5 million cycles to less than ~2.5 million, as shown below:

 

PPE and SPEs 0 thru 7:


SPEs 7 thru 15:

 

However, in order to further improve the performance we can replace the process_buffer function by an improved version which includes a loop unrolling by:

·        Unrolling the loop to provide additional instructions for interleaving.

·        Loading the DMA-buffer contents into local nonvolatile registers to eliminate volatile migration constraints..

·       Eliminating the scalar load of the inv_mass variable.

·       Eliminating extra multiplies of dt*inv_mass and splat the products after the SIMD multiply, instead of before the multiply.


Example: Further tuned SPE code for the process_buffer function:

 void process_buffer(int buffer, int cnt, vector float dt_v)

{

  int i;

  volatile vector float *p_inv_mass_v;

  vector float force_v, inv_mass_v;

  vector float pos0, pos1, pos2, pos3;

  vector float vel0, vel1, vel2, vel3;

  vector float dt_inv_mass_v, dt_inv_mass_v_0, dt_inv_mass_v_1, dt_inv_mass_v_2, dt_inv_mass_v_3;

  vector unsigned char splat_word_0 = (vector unsigned char){0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3};

  vector unsigned char splat_word_1 = (vector unsigned char){4, 5, 6, 7, 4, 5, 6, 7, 4, 5, 6, 7, 4, 5, 6, 7};

  vector unsigned char splat_word_2 = (vector unsigned char){8, 9,10,11, 8, 9,10,11, 8, 9,10,11, 8, 9,10,11};

  vector unsigned char splat_word_3 = (vector unsigned char){12,13,14,15,12,13,14,15,12,13,14,15,12,13,14,15};

 

  p_inv_mass_v = (volatile vector float *)&inv_mass[buffer][0];

  force_v = ctx.force_v;

 

  // Compute the step in time for the block of particles, four

  // particle at a time.

  for (i=0; i<cnt; i+=4) {

    inv_mass_v = *p_inv_mass_v++;

   

    pos0 = pos[buffer][i+0];

    pos1 = pos[buffer][i+1];

    pos2 = pos[buffer][i+2];

    pos3 = pos[buffer][i+3];

 

    vel0 = vel[buffer][i+0];

    vel1 = vel[buffer][i+1];

    vel2 = vel[buffer][i+2];

    vel3 = vel[buffer][i+3];

 

    dt_inv_mass_v = spu_mul(dt_v, inv_mass_v);

 

    pos0 = spu_madd(vel0, dt_v, pos0);

    pos1 = spu_madd(vel1, dt_v, pos1);

    pos2 = spu_madd(vel2, dt_v, pos2);

    pos3 = spu_madd(vel3, dt_v, pos3);

 

    dt_inv_mass_v_0 = spu_shuffle(dt_inv_mass_v, dt_inv_mass_v, splat_word_0);

    dt_inv_mass_v_1 = spu_shuffle(dt_inv_mass_v, dt_inv_mass_v, splat_word_1);

    dt_inv_mass_v_2 = spu_shuffle(dt_inv_mass_v, dt_inv_mass_v, splat_word_2);

    dt_inv_mass_v_3 = spu_shuffle(dt_inv_mass_v, dt_inv_mass_v, splat_word_3);

 

    vel0 = spu_madd(dt_inv_mass_v_0, force_v, vel0);

    vel1 = spu_madd(dt_inv_mass_v_1, force_v, vel1);

    vel2 = spu_madd(dt_inv_mass_v_2, force_v, vel2);

    vel3 = spu_madd(dt_inv_mass_v_3, force_v, vel3);

 

    pos[buffer][i+0] = pos0;

    pos[buffer][i+1] = pos1;

    pos[buffer][i+2] = pos2;

    pos[buffer][i+3] = pos3;

 

    vel[buffer][i+0] = vel0;

    vel[buffer][i+1] = vel1;

    vel[buffer][i+2] = vel2;

    vel[buffer][i+3] = vel3;

  }

}

 

Viewing trace data for the PPE and each SPE of the final tuned Euler example:

Resulted traces show another significant improvement in performance and a total of 1.6 Miliion cycles

PPE and SPEs 0 thru 7:

 

SPEs 7 thru 15:


References:

1.                          introductory overview of the Cell multiprocessor IBM JRD 49-4/5 | Introduction to the Cell multiprocessor

2.                          IBM Systems Journal issue 45-1, Online Game Technology IBM SJ 45-1 | Using advanced compiler technology

3.                              Introduction to the Cell Broadband Engine Cell Broadband Engine - IBM Microelectronics

4.                           Cell Broadband Engine™ processor – based systems White Paper

5.                           Spufs: The Cell Synergistic Processing Unit as a virtual file system

6.                           ”Maximizing the power of the Cell Broadband Engine processor: 25 tips to optimal application performance", by Dan Brokenshire

7.                           FFTC : Fastest Fourier Transform for the IBM Cell Broadband Engine David A. Bader and Virat Agarwal College of Computing Georgia Institute of Technology, 
Atlanta, GA, USA 30332

8.                          The Potential of the Cell Processor for Scientific Computing, by Samuel Williams, John Shalf, Leonid Oliker, Shoaib Kamil, Parry Husbands, Katherine Yelick, Computational Research Division,Lawrence Berkeley National Laboratory, Berkeley, CA 94720

9.                           FFTC : Fastest Fourier Transform for the IBM Cell Broadband Engine
David A. Bader and Virat Agarwal College of Computing Georgia Institute of Technology,  Atlanta, GA, USA 30332

10.                            White Paper - A Programming Example: Large Fast Fourier Transform ...

11.                        IBM BladeCenter QS20 Performance Potentials

12.                       ”Maximizing the power of the Cell Broadband Engine processor: 25 tips to optimal application performance", by Dan Brokenshire

13.                        Cell BE programming Tutorial http://www-01.ibm.com/chips/techlib/techlib.nsf/techdocs/FC857AE550F7EB83872571A80061F788?Open&S_TACT=105AGX16&S_CMP=LP

14.                           Thomas Chen, et al., “Cell Broadband Engine Architecture and its First Implementation – a Performance View”, http://www.ibm.com/developerworks/power/library/pa-cellperf/

15.                        Samuel Williams, et al., “The Potential of the Cell Processor for Scientific Computing”,
http://www.cs.berkeley.edu/~samw/projects/cell/CF06.pdf#search=%22Cell%20performance%20%20FFT%22

16.                        Carsten Benthin, et al., “Ray Tracing on the Cell Processor”, http://graphics.cs.unisb.de/~benthin/cellrt06.pdf#search=%22InTrace%20Ray%20Tracing%20Performance%20Opteron%22 bltiQS20perfwp103106.doc Page 5

17.                        Alex Chunghen Chow, et al., “A Programming Example: Large FFT on the Cell Broadband Engine”,
http://www.ibm.com/chips/techlib/techlib.nsf/techdocs/0AA2394A505EF0FB872570AB005BF0F1/$file/GSPx_FFT_paper_legal_0115.pdf#search=%22Cell%20performance%20%20FFT%22

18.                        B. Minor, G. Fossum, V. To, “Terrain Rendering Engine (TRE): Cell Broadband Optimized Real-Time Ray-Caster”, Proceedings of GPSx conference, October 2005.