Table of content:
1. Euler scalar example
·
Example: Euler scalar program
·
Compiling the Euler scalar example
·
Example: Euler scalar program
makefile
·
Using
Oprofile to collect sampling data
·
Analyzing sampling data
·
Example: Euler program sampling data
·
Example: Euler program sampling data at the
instruction-level
2. Euler
example that uses Array of Structure (AOS) to apply SIMD
directives
·
Example: Euler code that uses AOS to apply SIMD
directives
·
Compiling the AOS code to apply the SIMD
directives
·
Example: AOS makefile
·
Analyzing sampling data
·
Example: AOS sampling data
·
Improving performance with higher optimization level of
the compiler
o
Example: AOS makefile using gcc -03
o
Example: AOS gcc -03 sampling data
o
Example: AOS makefile using xlc -05
o
Example: AOS xlc -05 sampling data
3. Euler
example that uses Structure of Arrays (SOA) to apply SIMD
directives
·
Example: Euler code that uses SOA to apply SIMD
directives
·
Compiling the SOA code to apply the SIMD
directives
·
Example: SOA makefile
·
Analyzing sampling data
·
Example: SOA sampling data
4. Euler
example that uses single SPE unit
·
Example: PPE code for the single SPE
unit
·
Example: SPE code for the single SPE
unit
·
Compiling the single SPE Euler
example
o
Example: SPE makefile using gcc
o
Example: PPE makefile using gcc
·
Analyzing sampling data
·
Example: SPE sampling data
·
Improving the efficiency of SPU
macros
·
Example: SPE code for the single SPE unit with more
efficient SPE macros
·
Example: SPE sampling data with more efficient SPE
macros
·
Compiling the single SPE Euler that uses the more
efficient SPE macros
o
Example: SPE makefile using xlc -05
o
Example: PPE makefile using xlc -05
o
Example: xlc -05 sampling data
·
Collecting frequency profile using
FDPR-Pro
·
Example: Frequency profiling data
·
Adding branch hint directives
5. Euler
example that uses multiple SPEs
·
Example: PPE code for multiple SPEs
·
Compiling the multiple SPEs Euler
example
·
Example: PPE makefile for multiple
SPEs
6. Tuning
the Euler example that uses multiple SPEs
·
Compiling the multiple SPEs Euler example to collect
trace information
·
Example: SPE makefile to collect trace
information
·
Collecting trace information for the multiple
SPEs
·
Viewing trace data for the PPE and each
SPE
7. The
tuned Euler example using multiple SPE units
·
Example: SPE code for the tuned multiple
SPEs
·
Compiling the multiple SPEs Euler
example
·
Viewing trace data for the tuned multiple
SPE
·
Example: Further tuned SPE code for the process_buffer
function
·
Viewing trace data of the final tuned Euler
example
Programming Examples
Euler Programming Examples
In this programming example we demonstrate the usage of the visualized performance analysis together with basic programming optimizations and compiler optimization options to provide a simple methodology for porting and tuning Cell programs.
We start with a simple scalar program example which can also be found in the IBM Cell programming tutorial [13] and show all the recommended tuning steps using visualized performance analysis tools, until reaching a fully optimized, tuned and scaled Cell program version.
The paper is organized as follows: section 1 describes a simple scalar example of the Euler-based particle-system simulation. Sections 2 and 3 describe two known approaches to SIMDizing the program. Section 4 shows how to enable the example to run on a single SPE, followed by a version that uses multiple SPEs in section 5. Finally, section 6 describes how to tune and scale the multiple SPE Cell program to run optimally and without delays.
1. Euler Scalar
example
Throughout this dcoument we will use a simple Euler-based particle-system simulation.
The example uses numerical integration techniques using Euler's method of integration by computing the next value of a function of time F(t), using the Euler step formula:
F(t+dt) = F(t) + F'(t) * dt;
The particle system in this example consists of:
· Array of 3-D positions for each particle (pos[]) expressed by 4-D homogenous coordinates (x,y,z,1)
· Array of 3-D velocities for each particle (vel[]) expressed as 4-D homogenous coordinates (x,y,z,1)
· Array of masses for each particle (mass[]) expressed as 4-D homogenous coordinates (x,y,z,1)
· Force vector that changes over time (force)
The simple scalar
code version of the program is given here along with compilation
instructions:
Euler scalar
program:
#define END_OF_TIME
10
#define PARTICLES
100000
/* 4-D vector
definition.
*/
typedef struct {
float x, y, z, w;
} vec4D;
vec4D pos[PARTICLES];
// particle
positions
vec4D vel[PARTICLES];
// particle velocities
vec4D force;
// current force being applied to the particles
float inv_mass[PARTICLES]; // inverse mass
of the particles
float dt = 1.0f;
// step in time
int main()
{
int i;
float time;
float dt_inv_mass;
// For each step in time
for (time=0; time<END_OF_TIME; time += dt) {
//
For each particle
for (i=0; i<PARTICLES; i++) {
// Compute the new position and velocity as acted upon by the force
f.
pos[i].x = vel[i].x * dt + pos[i].x;
pos[i].y = vel[i].y * dt + pos[i].y;
pos[i].z = vel[i].z * dt + pos[i].z;
dt_inv_mass = dt * inv_mass[i];
vel[i].x = dt_inv_mass * force.x +
vel[i].x;
vel[i].y = dt_inv_mass * force.y +
vel[i].y;
vel[i].z = dt_inv_mass * force.z +
vel[i].z;
}
}
return (0);
}
Compiling
the Euler scalar example:
Providing that the
above code is placed in a file called "euler_scalar.c", then to compile it,
simply use the following command, which includes the options of preserving
relocation and line number information in the generated binary
"euler_scalar":
$ cc –g –Wl,-q –o euler_scalar euler_scalar.c
Or, preferably, use the following Makefile:
Example:
Euler scalar program makefile:
########################################################################
#
Target
########################################################################
PROGRAM_ppu
:= euler_scalar
########################################################################
#
Local Defines
########################################################################
INCLUDE
:= -I ..
CFLAGS
= -g
LDFLAGS
= -g -Wl,-q
########################################################################
#
buildutils/make.footer
########################################################################
ifdef CELL_TOP
include
$(CELL_TOP)/buildutils/make.footer
else
include
../../../../buildutils/make.footer
endif
When compiling the above program to include line number information we can already use the Oprofile and VPA Cell performance tools to visualize the program, even before being SIMdized or parallelized for the Cell BE. This information is going to be useful in the next steps in order to evaluate the overhead which is accompanied by the SIMDization and the parallelization process.
Using
Oprofile to collect sampling data:
As root, type the
following instructions to collect profiling using
Oprofile
# opcontrol --deinit // recommended to clear
any previous Oprofile data and bringing the daemon down
# opcontrol --init
# opcontrol --reset
Signalling daemon...
done
// to collect PPU
average cycle events:
# opcontrol
--separate=all--event=CYCLES:100000 // for SPU cycles, use: --event=SPU_CYCLES:100000
# opcontrol --start
Profiler
running.
#
euler_scalar // Running the Euler
example program
# opcontrol --stop
Stopping
profiling.
# opcontrol --dump
# opreport -X -g -l -d -o
euler.opm // To generate an XML
output report called: euler.opm
Analyzing
sampling data:
After gathering profiling, using the Oprofile tool, about the distribution of the cycles that are spent in the body of the inner loop, we can display them with the Profile Analyzer plugin of the Visual Performance Analyzer (VPA) tool, as follows:
Example:
Euler program sampling data:

Example:
Euler program sampling data at the instruction level:
It may be interesting to see the same cycles distribution at the instruction level:

Surprisingly the
largest number of spent cycles is on the "fadds" instruction that is generated
from the following C statement in the inner loop:
pos[i].z = vel[i].z * dt +
pos[i].z;
2. Euler example that uses Array of
Structure (AOS) to apply SIMD directives
The most common technique for SIMDizing code when input is naturally expressed as a vector is called Array Of Structures (AOS).
The
following example shows how to SIMDize in the AOS form.
Vector/SIMD Multimedia Extension intrinsics are used, and they can be identified
by their prefix
vec_.
The algorithm assumes that the number of particles is a multiple of four.
Special code must be included to handle the last number of particles that is not
a multiple of four.
Example: Euler code that uses AOS to apply SIMD
directives:
#define END_OF_TIME 10
#define PARTICLES 100000
/*
4-D vector definition.
*/
typedef struct {
float x, y, z,
w;
} vec4D;
vec4D pos[PARTICLES] __attribute__
((aligned (16)));
vec4D vel[PARTICLES] __attribute__
((aligned (16)));
vec4D force __attribute__ ((aligned
(16)));
float inv_mass[PARTICLES] __attribute__
((aligned (16)));
float dt __attribute__ ((aligned (16)))
= 1.0f;
int main() {
int i;
float time;
float
dt_inv_mass __attribute__ ((aligned (16)));
vector float
dt_v, dt_inv_mass_v;
vector float
*pos_v, *vel_v, force_v;
vector float
zero = (vector float){0.0f, 0.0f, 0.0f, 0.0f};
pos_v = (vector float *)pos;
vel_v = (vector float *)vel;
force_v = *((vector float *)&force);
// Replicate the variable time step
across elements 0-2 of
// a floating point vector. Force the
last element (3) to zero.
dt_v = vec_sld(vec_splat(vec_lde(0, &dt), 0), zero, 4);
// For each
step in time
for (time=0;
time<END_OF_TIME; time += dt) {
// For each particle
for
(i=0; i<PARTICLES; i++) {
// Compute the
new position and velocity as acted upon by the force f.
pos_v[i] =
vec_madd(vel_v[i], dt_v, pos_v[i]);
dt_inv_mass =
dt * inv_mass[i];
dt_inv_mass_v =
vec_splat(vec_lde(0, &dt_inv_mass), 0);
vel_v[i] =
vec_madd(dt_inv_mass_v, force_v, vel_v[i]);
}
}
return
(0);
}
Compiling the AOS code to apply the SIMD
directives:
Example: AOS
makefile
Provided
that the file name containing the above code is "euler_simd_aos.c", then to
compile it, use the following recommended Makefile:
########################################################################
#
Target
########################################################################
PROGRAM_ppu
:= euler_simd_aos
########################################################################
#
Local Defines
########################################################################
INCLUDE
:= -I ..
CFLAGS
= -g
LDFLAGS
= -g -Wl,-q
########################################################################
#
buildutils/make.footer
########################################################################
ifdef CELL_TOP
include
$(CELL_TOP)/buildutils/make.footer
else
include
../../../../buildutils/make.footer
endif
Analyzing sampling data for AOS
example:
Here
is the collected Oprofile for the AOS SIMDized version:

Although the total of spent cycles
stands on 909, the collected Oprofile
shows that performance bottleneck is now on the statement
pos_v[i] = vec_madd(vel_v[i], dt_v, pos_v[i]);
(59.6%)
Improving
performance with higher optimization level of the
compiler:
One
basic way to improve the performance is to use higher optimization levels of the
compiler.
Here is the modified Makefile using gcc –O3 level:
Example: AOS makefile using
gcc –O3:
########################################################################
#
Target
########################################################################
PROGRAM_ppu
:= euler_simd_aos
########################################################################
#
Local Defines
########################################################################
INCLUDE
:= -I ..
CC_OPT_LEVEL := -O3
CFLAGS
= -g
LDFLAGS
= -g -Wl,-q
########################################################################
#
buildutils/make.footer
########################################################################
ifdef CELL_TOP
include
$(CELL_TOP)/buildutils/make.footer
else
include
../../../../buildutils/make.footer
endif
Example:
AOS gcc -03 sampling data:
After
applying optimization option of gcc –O3 we can see an improvement in the total
of cycles to 721 as shown below:
Example: AOS makefile using xlc
–O5:
A
higher optimization level of –O5 can be applied using the IBM xlc
compiler.
Here is the modified Makefile using xlc –O5 level:
########################################################################
#
Target
########################################################################
PROGRAM_ppu
:= euler_simd_aos
########################################################################
#
Local Defines
########################################################################
PPU_COMPILER = xlc
INCLUDE
:= -I ..
CC_OPT_LEVEL := -O5
CFLAGS
= -g
LDFLAGS
= -g -Wl,-q
########################################################################
#
buildutils/make.footer
########################################################################
ifdef CELL_TOP
include
$(CELL_TOP)/buildutils/make.footer
else
include
../../../../buildutils/make.footer
endif
Example:
AOS xlc -05 sampling data:
An
even lower number of 531 cycles with no isolated bottlenecks can be obtained
using the xlC –O5 compiler option.
Here
is a screen shot of the generated profiled code:

3. Euler example that uses Structure of
Arrays (SOA) to apply SIMD directives
Another form of
SIMDization is called Structure-of-Arrays (SOA). In this form, you
collect the individual elements of the natural vectors into separate arrays
(also called parallel-array form). The code is then written as if it were to execute scalar instructions, but it will be executing
SIMD instructions. This results in code that computes four single-precision
floats results simultaneously.
The
following example shows how to SIMDize the Euler code example
in the SOA form. As in Step 1a, the algorithm assumes that the number of
particles is a multiple of 4.
Example: Euler code that uses SOA to apply SIMD
directives:
#define END_OF_TIME 10
#define PARTICLES 100000
typedef struct {
float x, y, z,
w;
} vec4D;
// Separate
arrays for each component of the vector.
vector float pos_x[PARTICLES/4],
pos_y[PARTICLES/4], pos_z[PARTICLES/4];
vector float vel_x[PARTICLES/4],
vel_y[PARTICLES/4], vel_z[PARTICLES/4];
vec4D force __attribute__ ((aligned
(16)));
float inv_mass[PARTICLES] __attribute__
((aligned (16)));
float dt = 1.0f;
int main() {
int i;
float time;
float
dt_inv_mass __attribute__ ((aligned (16)));
vector float
force_v, force_x, force_y, force_z;
vector float
dt_v, dt_inv_mass_v;
// Create a
replicated vector for each component of the force vector.
force_v = *(vector float *)(&force);
force_x = vec_splat(force_v, 0);
force_y = vec_splat(force_v, 1);
force_z = vec_splat(force_v, 2);
// Replicate the variable time step
across all elements.
dt_v = vec_splat(vec_lde(0, &dt), 0);
// For each
step in time
for (time=0;
time<END_OF_TIME; time += dt) {
// For each particle
for
(i=0; i<PARTICLES/4; i++) {
// Compute the
new position and velocity as acted upon by the force f.
pos_x[i] =
vec_madd(vel_x[i], dt_v, pos_x[i]);
pos_y[i] =
vec_madd(vel_y[i], dt_v, pos_y[i]);
pos_z[i] =
vec_madd(vel_z[i], dt_v, pos_z[i]);
dt_inv_mass =
dt * inv_mass[i];
dt_inv_mass_v =
vec_splat(vec_lde(0, &dt_inv_mass), 0);
vel_x[i] =
vec_madd(dt_inv_mass_v, force_x, vel_x[i]);
vel_y[i] =
vec_madd(dt_inv_mass_v, force_y, vel_y[i]);
vel_z[i] =
vec_madd(dt_inv_mass_v, force_z, vel_z[i]);
}
}
return (0);
}
Compiling the SOA code to apply the SIMD
directives:
Example: SOA
makefile:
To
compile the above SOA SIMDized version of the Euler example, use the same
Makefile that was used for the AOS case, and simply set
the PROGRAM_ppu variable to the name of the C file containing the above
code, as follows:
########################################################################
#
Target
########################################################################
PROGRAM_ppu
:= euler_simd_soa
########################################################################
#
Local Defines
########################################################################
INCLUDE
:= -I ..
CFLAGS
= -g
LDFLAGS
= -g -Wl,-q
########################################################################
#
buildutils/make.footer
########################################################################
ifdef CELL_TOP
include
$(CELL_TOP)/buildutils/make.footer
else
include
../../../../buildutils/make.footer
endif
Analyzing sampling data:
Example: SOA sampling
data:
Interestingly, the following screen shot of the Oprofile collected for the SIMdized SOA case shows similar bottlenecks on the statement of:
pos_x[i] = vec_madd(vel_x[i], dt_v, pos_x[i]); (59.37%)

Furthermore, the same tuning
process using compiler optimizations of gcc –O3 and xlC –O5 provides
similar results to the SOA SIMDization case.
4. Euler example that uses single SPE unit
A simple code example of the Euler program running on
the Cell and using a single SPE processor is to simply move the entire loops
computation to the SPE part.
The generated code for this simple case would
look as follows:
Example: PPE code for the single SPE
unit:
#define END_OF_TIME
10
#define PARTICLES
100000
typedef struct { /* 4-D vector
definition. */
float x, y, z,
w;
}
vec4D;
typedef struct {
int particles; // number
of particle to process
vector float *pos_v; //
pointer to array of positions vectors
vector float
*vel_v; // pointer to array of velocity
vectors
float
*inv_mass; //
pointer to array of mass vectors
vector float force_v; // force
vector
float dt;
// current step in time
} parm_context;
vec4D pos[PARTICLES] __attribute__
((aligned (16)));
vec4D vel[PARTICLES] __attribute__
((aligned (16)));
vec4D force __attribute__ ((aligned
(16)));
float inv_mass[PARTICLES] __attribute__
((aligned (16)));
float dt = 1.0f;
extern spe_program_handle_t
particle;
typedef struct ppu_pthread_data
{
spe_context_ptr_t
spe_ctx;
pthread_t
pthread;
void
*argp;
}
ppu_pthread_data_t;
void *ppu_pthread_function(void *arg)
{
ppu_pthread_data_t *datap =
(ppu_pthread_data_t *)arg;
unsigned int
entry = SPE_DEFAULT_ENTRY;
if
(spe_context_run(datap->spe_ctx, &entry, 0, datap->argp, NULL, NULL)
< 0) {
perror ("Failed running context");
exit (1);
}
pthread_exit(NULL);
}
int main()
{
ppu_pthread_data_t data;
parm_context
ctx __attribute__ ((aligned (16)));
ctx.particles =
PARTICLES;
ctx.pos_v = (vector float *)pos;
ctx.vel_v = (vector float *)vel;
ctx.force_v = *((vector float *)&force);
ctx.inv_mass =
inv_mass;
ctx.dt = dt;
/* Create a SPE
context */
if
((data.spe_ctx = spe_context_create (0, NULL)) == NULL) {
perror ("Failed creating context");
exit (1);
}
/* Load SPE program into the SPE
context*/
if
(spe_program_load (data.spe_ctx, &particle)) {
perror ("Failed loading program");
exit (1);
}
/* Initialize context run data
*/
data.argp =
&ctx;
/* Create
pthread for each of the SPE contexts */
if
(pthread_create (&data.pthread, NULL, &ppu_pthread_function, &data))
{
perror ("Failed creating thread");
exit (1);
}
/* Wait for the threads to complete
*/
if
(pthread_join (data.pthread, NULL)) {
perror ("Failed joining thread\n");
exit (1);
}
/* Destroy SPE
context */
if
(spe_context_destroy (data.spe_ctx) != 0) {
perror("Failed destroying context");
exit (1);
}
return
(0);
}
The following SPE code runs the loops body on the SPE processor.
Example: SPE code for the single SPE
unit:
#define END_OF_TIME
10
#define PARTICLES
100000
typedef struct { /* 4-D vector definition.
*/
float x, y, z,
w;
}
vec4D;
typedef struct {
int particles; // number
of particle to process
vector float *pos_v; //
pointer to array of positions vectors
vector float
*vel_v; // pointer to array of velocity
vectors
float
*inv_mass; //
pointer to array of mass vectors
vector float force_v; // force
vector
float dt;
//
current step in time
} parm_context;
#define PARTICLES_PER_BLOCK 1024 /* # of particles to process
at a time */
// Local store structures and buffers.
volatile parm_context
ctx;
volatile vector float
pos[PARTICLES_PER_BLOCK];
volatile vector float
vel[PARTICLES_PER_BLOCK];
volatile float
inv_mass[PARTICLES_PER_BLOCK];
int main(unsigned long long
spu_id __attribute__ ((unused)), unsigned long long parm)
{
int i,
j;
int left,
cnt;
float
time;
unsigned int
tag_id;
vector float
dt_v, dt_inv_mass_v;
// Reserve a tag
ID
tag_id = mfc_tag_reserve();
spu_writech(MFC_WrTagMask, -1);
// Input parameter parm is a pointer to
the particle parameter context.
// Fetch the context, waiting for it to
complete.
spu_mfcdma32((void *)(&ctx), (unsigned int)parm,
sizeof(parm_context), tag_id, MFC_GET_CMD);
(void)spu_mfcstat(MFC_TAG_UPDATE_ALL);
dt_v = spu_splats(ctx.dt);
// For each
step in time
for (time=0;
time<END_OF_TIME; time += ctx.dt) {
// For each block of particles
for
(i=0; i<ctx.particles; i+=PARTICLES_PER_BLOCK) {
// Determine the number of particles in this
block.
left = ctx.particles - i;
cnt = (left < PARTICLES_PER_BLOCK) ? left : PARTICLES_PER_BLOCK;
// Fetch the
data - position, velocity and inverse_mass. Wait for the DMA to complete
// before
performing computation.
spu_mfcdma32((void *)(pos), (unsigned int)(ctx.pos_v+i),
cnt * sizeof(vector float), tag_id,
MFC_GETB_CMD);
spu_mfcdma32((void *)(vel), (unsigned int)(ctx.vel_v+i),
cnt * sizeof(vector float), tag_id,
MFC_GET_CMD);
spu_mfcdma32((void *)(inv_mass), (unsigned int)(ctx.inv_mass+i),
cnt * sizeof(float), tag_id,
MFC_GET_CMD);
(void)spu_mfcstat(MFC_TAG_UPDATE_ALL);
// Compute the step in time for the block
of particles
for (j=0; j<cnt; j++) {
pos[j] = spu_madd(vel[j], dt_v,
pos[j]);
dt_inv_mass_v = spu_mul(dt_v,
spu_splats(inv_mass[j]));
vel[j] = spu_madd(dt_inv_mass_v, ctx.force_v,
vel[j]);
}
// Put the
position and velocity data back into system memory
spu_mfcdma32
((void *)(pos), (unsigned int)(ctx.pos_v+i),
cnt * sizeof(vector float), tag_id,
MFC_PUT_CMD);
spu_mfcdma32((void *)(vel), (unsigned int)(ctx.vel_v+i),
cnt * sizeof(vector float), tag_id,
MFC_PUT_CMD);
}
}
// Wait for final DMAs to complete
before terminating SPU thread.
(void)spu_mfcstat(MFC_TAG_UPDATE_ALL);
return
(0);
}
Compiling the single SPE euler example:
Example: SPE makefile using
gcc:
In order to compile the above code, place the SPE code part file "particle.c" in a subdirectory called "spu" under the directory where the PPE code file "euler_spe.c" is placed.
Here is the Makefile for the SPE code:
########################################################################
#
Target
########################################################################
PROGRAM_spu
:= particle
LIBRARY_embed
:= lib_particle_spu.a
########################################################################
#
Local Defines
########################################################################
INCLUDE := -I ..
CFLAGS
= -g
LDFLAGS
= -g -Wl,-q
########################################################################
#
buildutils/make.footer
########################################################################
ifdef CELL_TOP
include
$(CELL_TOP)/buildutils/make.footer
else
include
../../../../../buildutils/make.footer
endif
Example: PPE makefile using gcc:
Here is the Makefile for the PPE code part in the file "euler_spe.c", located one directory above the "spu" directory:
########################################################################
#
Subdirectories
########################################################################
DIRS
:= spu
########################################################################
#
Target
########################################################################
PROGRAM_ppu :=
euler_spe
########################################################################
#
Local Defines
########################################################################
IMPORTS
= spu/lib_particle_spu.a -lspe2 -lpthread
CFLAGS
= -g
LDFLAGS
= -g -Wl,-q
########################################################################
#
buildutils/make.footer
########################################################################
ifdef CELL_TOP
include
$(CELL_TOP)/buildutils/make.footer
else
include
../../../../buildutils/make.footer
endif
Analyzing sampling
data:
In general, when moving the hot loops to run in the SPE rather than on the PPE, the number of cycles is expected to increase with compare to the same loop when run on the PPE.
Example: SPE
sampling data:
Here is what the Oprofile collects for the SPE code of the above Cell version of the Euler program (using the SPU_CYCLES event of Oprofile):

As can be seen, the total number of cycles increases from 909 to 7680.
It is, therefore, highly important to make sure that the SPU macros that are used, will be efficient.
Improving the effiecency of SPU
macros:
The following code
replaces the non-efficient spu_mfcdma32 macro by the more recent
and more efficient API macros: mfc_get, mfc_getb, mfc_put,
mfc_write_tag_update_all:
Example: SPE code for the single SPE unit with
more efficient SPE macros:
#define END_OF_TIME
10
#define PARTICLES
100000
typedef struct { /* 4-D vector definition.
*/
float x, y, z,
w;
}
vec4D;
typedef struct {
int particles; // number
of particle to process
vector float *pos_v; //
pointer to array of positions vectors
vector float
*vel_v; // pointer to array of velocity
vectors
float
*inv_mass; // pointer
to array of mass vectors
vector float force_v; // force vector
float dt;
// current step in time
} parm_context;
#define PARTICLES_PER_BLOCK 1024 /* # of particles to process
at a time */
// Local store
structures and buffers.
volatile parm_context ctx;
volatile vector float
pos[PARTICLES_PER_BLOCK];
volatile vector float
vel[PARTICLES_PER_BLOCK];
volatile float
inv_mass[PARTICLES_PER_BLOCK];
int main(unsigned long long spu_id __attribute__ ((unused)),
unsigned long long parm)
{
int i, j;
int left, cnt;
float time;
unsigned int tag_id;
vector float dt_v, dt_inv_mass_v;
// Reserve
a tag ID
tag_id =
mfc_tag_reserve();
mfc_write_tag_mask(-1);
// Input
parameter parm is a pointer to the particle parameter
context.
// Fetch
the context, waiting for it to complete.
mfc_get((void *)(&ctx), (unsigned int)parm,
sizeof(parm_context), tag_id, 0, 0);
//
mfc_write_tag_update_all();
mfc_read_tag_status_all();
dt_v =
spu_splats(ctx.dt);
// For each step in time
for (time=0; time<END_OF_TIME; time += ctx.dt)
{
// For each block of
particles
for (i=0; i<ctx.particles;
i+=PARTICLES_PER_BLOCK) {
// Determine the number of particles in this
block.
left = ctx.particles - i;
cnt = (left < PARTICLES_PER_BLOCK) ? left : PARTICLES_PER_BLOCK;
// Fetch the
data - position, velocity and inverse_mass. Wait for the DMA to complete
// before
performing computation.
mfc_getb((void *)(pos), (unsigned int)(ctx.pos_v+i), cnt *
sizeof(vector float),tag_id,0,0);
mfc_get((void *)(vel), (unsigned int)(ctx.vel_v+i), cnt *
sizeof(vector float), tag_id, 0,0);
mfc_get((void *)(inv_mass),(unsigned int)(ctx.inv_mass+i), cnt *
sizeof(float),tag_id,0,0);
mfc_write_tag_update_all();
mfc_read_tag_status_all();
// Compute the
step in time for the block of particles
for (j=0; j<cnt; j++) {
pos[j] = spu_madd(vel[j], dt_v,
pos[j]);
dt_inv_mass_v = spu_mul(dt_v,
spu_splats(inv_mass[j]));
vel[j] = spu_madd(dt_inv_mass_v, ctx.force_v,
vel[j]);
}
// Put the
position and velocity data back into system memory
mfc_put((void *)(pos), (unsigned int)(ctx.pos_v+i), cnt *
sizeof(vector float), tag_id, 0,0);
mfc_put((void *)(vel), (unsigned int)(ctx.vel_v+i), cnt *
sizeof(vector float), tag_id, 0,0);
}
}
// Wait for
final DMAs to complete before terminating SPU thread.
mfc_read_tag_status_all();
return (0);
}
Example: SPE sampling data with more efficient SPE macros:
As can be seen, by the following collected profiling, the total number of cycles is significantly reduced to 1332:

Compiling the single SPE euler that uses more efficient SPE
macros:
As before, we can improve the performance by increasing the optimization level of the compiler.
Here are the recommended Makefiles for the SPE Euler example, when using xlc –O5:
Example:
SPE makefile using xlc -05:
########################################################################
#
Target
########################################################################
PROGRAM_spu
:= particle
LIBRARY_embed
:= lib_particle_spu.a
########################################################################
#
Local Defines
########################################################################
SPU_COMPILER := xlc
INCLUDE
:= -I ..
CC_OPT_LEVEL := -O5
CFLAGS
= -g
LDFLAGS
= -g -Wl,-q
########################################################################
#
buildutils/make.footer
########################################################################
ifdef CELL_TOP
include
$(CELL_TOP)/buildutils/make.footer
else
include
../../../../../buildutils/make.footer
endif
Example:
PPE makefile using xlc -05:
########################################################################
#
Subdirectories
########################################################################
DIRS
:= spu
########################################################################
#
Target
########################################################################
PROGRAM_ppu :=
euler_spe
########################################################################
#
Local Defines
########################################################################
PPU_COMPILER = xlc
IMPORTS
= spu/lib_particle_spu.a -lspe2 -lpthread
CC_OPT_LEVEL := -O5
CFLAGS
= -g
LDFLAGS
= -g -Wl,-q
########################################################################
#
buildutils/make.footer
########################################################################
ifdef CELL_TOP
include
$(CELL_TOP)/buildutils/make.footer
else
include
../../../../buildutils/make.footer
endif
Example: xlc
-05 sampling data:
Applying xlC –O5 reduces it manages to slightly reduce the total number cycles to 1324 cycles with no outstanding bottlenecks at the instruction level:

Collecting
frequency profile using FDPR-Pro:
Another important tool that can help with the porting and the tuning process is the FDPR-Pro which is able to collect accurate frequency profile by instrumenting the binary file and then displaying the collected frequency along with appropriate performance comments, on the Code Analyzer plugin of the VPA tool.
In order to instrument the Euler single SPE example and collect thefreequency profile, use the FDPR-Pro tool directly on the euler_spe executable.
The FDPR-Pro RPM file for Cell consists of 3 main files:
-
/usr/bin/fdprpro
-
/usr/lib/libfdprinst32.so
- /usr/lib64/libfdprinst64.so
Important: remove all previously generated SPU profile files (suffixed by .mprof):
$ rm *.mprof
Recommended: create a temporary directory for FDPR-Pro tmp files generated during its analysis phase:
$ mkdir ./tmpdir
Run the following command to instrument the euler_spe exec by FDPR-Pro and to create an instrumented file called fft.instr along with an empty profile file euler.nprof:
$ fdprpro euler_spe -cell -spedir tmpdir -a instr
This
will generate a new file named euler_spe.instr
Running this will automatically generate two files containing frequency profiling information:
- euler_spe.nprof // profiling frequency information for the PPU code part
- particle.mprof // profiling frequency information for each of the SPU code part
The binary executable file euler_spe can now be loaded and analyzed by Code Analyzer plugin in VPA, along with the generated profile files.
Example:
Frequency profiling data:
Here
is what Code Analyzer displays for the collected frequency profiling of the
Euler SPE program: 
Adding branch hint
directives:
From the above view, here is an important performance comment which Code Analyzer displays about a missing hint instruction from the SPE code:

Branch hint bits in the Cell SPE are the equivalent of the branch prediction bits that are used on the PPE. However, with the lack of profiling information, the compiler cannot always determine if a hint bit is needed.
Hint bits on the SPE processor usually have more impact on the performance with compare to the PPE processor, as there is no hardware in the SPE to support branch prediction.
To explicitly specify a branch hint in the Cell SPE,
one can use the following intrinsic:
__builtin_expect(
expression, expected_value )
which generates branching code based off of the first argument, buffer < buffer_end, but produces branch hints based off of the second argument, 1.
This intrinsic instructs the compiler to generate hints
that expect the value of buffer <
buffer_end to be 1.
For
example:
while(__builtin_expect(buffer
< buffer_end, 1)) {
*buffer =
operation_on_buffer(*buffer);
buffer++;
}
The modified SPE code for the Euler example can, therefore, include the following hints statements for all its loops:
for (time=0; __builtin_expect(time<END_OF_TIME,1); time += ctx.dt) {
…
for (i=0; __builtin_expect(i<ctx.particles,1); i+=PARTICLES_PER_BLOCK) {
…
for (j=0; __builtin_expect(j<cnt,1); j++) {
…
5. Euler example that uses multiple
SPEs
When coming to parallelize scalar code we always search first for the outer loop to be parallelized. This is in contrast to the SIMDization process in which we always try to vectorize the operations of the most inner loop.
In our example the nested loops running on a single SPE are easily parallelized such that each chunk runs on a different SPE, as follows.
Since we will use the same SPE code from the previous version, only the PPE code changes as follows:
Example: PPE code for multiple SPEs:
#define END_OF_TIME 10
#define PARTICLES
100000
typedef struct { /* 4-D vector definition.
*/
float x, y, z,
w;
}
vec4D;
#define MAX_SPE_THREADS
16
vec4D pos[PARTICLES] __attribute__
((aligned (16)));
vec4D vel[PARTICLES] __attribute__
((aligned (16)));
vec4D force __attribute__ ((aligned
(16)));
float inv_mass[PARTICLES] __attribute__
((aligned (16)));
float dt = 1.0f;
extern spe_program_handle_t
particle;
typedef struct ppu_pthread_data
{
spe_context_ptr_t
spe_ctx;
pthread_t
pthread;
void
*argp;
}
ppu_pthread_data_t;
void *ppu_pthread_function(void *arg)
{
ppu_pthread_data_t *datap =
(ppu_pthread_data_t *)arg;
unsigned int
entry = SPE_DEFAULT_ENTRY;
if
(spe_context_run(datap->spe_ctx, &entry, 0, datap->argp, NULL, NULL)
< 0) {
perror ("Failed running context");
exit (1);
}
pthread_exit(NULL);
}
int main()
{
int i, offset,
count, spe_threads;
ppu_pthread_data_t
datas[MAX_SPE_THREADS];
parm_context
ctxs[MAX_SPE_THREADS] __attribute__ ((aligned (16)));
/* Determine
the number of SPE threads to create.
*/
spe_threads = spe_cpu_info_get(SPE_COUNT_USABLE_SPES, -1);
if (spe_threads
> MAX_SPE_THREADS) spe_threads = MAX_SPE_THREADS;
/* Create
multiple SPE threads */
for (i=0,
offset=0; i<spe_threads; i++, offset+=count) {
/* Construct a parameter
context for each SPE. Make sure
* that each SPEs (excluding the last) particle count is a
multiple
* of 4 so that inv_mass context pointer is always quadword
aligned.
*/
count = (PARTICLES / spe_threads + 3) &
~3;
ctxs[i].particles = (i==(spe_threads-1)) ? PARTICLES - offset : count;
ctxs[i].pos_v = (vector float
*)&pos[offset];
ctxs[i].vel_v = (vector float
*)&vel[offset];
ctxs[i].force_v = *((vector float
*)&force);
ctxs[i].inv_mass =
&inv_mass[offset];
ctxs[i].dt = dt;
/* Create SPE context */
if
((datas[i].spe_ctx = spe_context_create (0, NULL)) == NULL)
{
perror ("Failed creating
context");
exit (1);
}
/* Load SPE program into the
SPE context */
if
(spe_program_load (datas[i].spe_ctx, &particle)) {
perror ("Failed loading program");
exit (1);
}
/* Initialize context run
data */
datas[i].argp = &ctxs[i];
/* Create pthread for each of the SPE conexts
*/
if
(pthread_create (&datas[i].pthread, NULL, &ppu_pthread_function,
&datas[i])) {
perror ("Failed creating thread");
}
}
/* Wait for all the SPE threads to
complete.*/
for (i=0;
i<spe_threads; i++) {
if
(pthread_join (datas[i].pthread, NULL)) {
perror ("Failed joining thread");
exit (1);
}
/* Destroy context */
if
(spe_context_destroy (datas[i].spe_ctx) != 0) {
perror("Failed destroying context");
exit (1);
}
}
return
(0);
}
Compiling the multiple SPEs euler example:
To compile the above
Euler example for the multiple SPEs case, use the same recommended Makefiles
that were used for the previous Euler example for the single SPE case, after
modifying the file name of the PPE part in the PROGRAM_ppu variable,
accordingly:
Example: PPE makefile for multiple
SPEs:
########################################################################
#
Subdirectories
########################################################################
DIRS
:= spu
########################################################################
#
Target
########################################################################
PROGRAM_ppu :=
euler_multi_spe
########################################################################
#
Local Defines
########################################################################
PPU_COMPILER = xlc
IMPORTS
= spu/lib_particle_spu.a -lspe2 -lpthread
CC_OPT_LEVEL := -O5
CFLAGS
= -g
LDFLAGS
= -g -Wl,-q
########################################################################
#
buildutils/make.footer
########################################################################
ifdef CELL_TOP
include
$(CELL_TOP)/buildutils/make.footer
else
include
../../../../buildutils/make.footer
endif
6. Tuning the Euler example that uses multiple
SPEs
Since we already tuned the SPE code part separately, our focus now is to make sure that the program is properly scaled and free of synchronization effects that can easily eat away any potential performance gain from the parallelized program.
A useful way to visualize a running program on Cell with multiple SPEs, is to collect traces by the PDT tool and then display them using the Trace Analyzer plugin of VPA.
Compiling the multiple SPEs euler example to collect trace
information:
In order to collect PDT traces there is a need to modify the Makefile for the SPE code as follows, whereas the PPE Makefile is left unchanged:
Example: SPE makefile to collect trace
information:
########################################################################
#
Target
########################################################################
PROGRAM_spu
:= particle
LIBRARY_embed
:= lib_particle_spu.a
########################################################################
#
Local Defines
########################################################################
SPU_COMPILER := xlc
INCLUDE
:= -I ..
CC_OPT_LEVEL := -O5
########################################################################
#
Target
########################################################################
PROGRAM_spu :=
particle
LIBRARY_embed
:= lib_particle_spu.a
########################################################################
#
Local Defines
########################################################################
SPU_COMPILER := xlc
INCLUDE
:= -I ..
CC_OPT_LEVEL := -O5
CPPFLAGS_xlc := -I/usr/spu/include/trace
-Dmain=_pdt_main -Dexit=_pdt_exit -DMFCIO_TRACE -DLIBSYNC_TRACE
LDFLAGS_xlc :=
-L/usr/spu/lib/trace -ltrace
CFLAGS
= -g
LDFLAGS
= -g -Wl,-q
########################################################################
#
buildutils/make.footer
########################################################################
ifdef CELL_TOP
include
$(CELL_TOP)/buildutils/make.footer
else
include
../../../../../buildutils/make.footer
endif
Collecting
trace information for the multiple SPEs:
Before running the
PDT compiled Euler program, the following environment variables need to be
exported:
$ export
LD_LIBRARY_PATH=/usr/lib/trace
$ export
PDT_KERNEL_MODULE=/usr/lib/modules/pdt.ko
$ export
PDT_CONFIG_FILE=/usr/share/pdt/config/pdt_cbe_configuration.xml
Analyzing trace information:
It is interesting
to look at the traced SPE events collected for STEP3_multi_spe version
using the Trace Analyzer plugin of VPA.
The following
screen shots show the trace as collected for the PPE and each SPE, for the Euler
program. The green color on the SPEs rows represents regular computation with no
specific calls to SPE API macros. The blue color represents MFCIO events such as
SPE_MFC_GET, SPE_MFC_GETB and SPE_MFC_PUT. No color (white
color) represents non-computation mode, while all other colors represent
synchronization events resulting from the SPE API macros.
Viewing trace data for the PPE and each
SPE:
SPEs 0 thru
7:

SPEs 7 thru 15:

When zooming into the blue areas, as shown in the following screen shot, we immediately see that the dominant delaying events are the SPE_MFC_READ_TAG_STATUS events which take up most of the beginning of each of the SPE execution time. This event means that the threads are waiting for the DMA read to complete. One of the reasons appear to be the synchronization created between the SPE_MFC_GET and the SPE_MFC_PUT events on each SPE (shown in purple and yellow colors) which cause a delay in the DMA read completion for the proceeding SPEs.

7. The Tuned Euler example using multiple SPE units
In order to resolve the synchronization bottleneck shown above and make the program scalable for the multiple running SPEs, it is clear that we need to focus on parallelizing the input and output DMA transfers, and as a result shorten the delays caused by the DMA read transfers.
One way to resolve this is to interleave DMA transfers with computation by multibuffering the inputs and outputs to eliminate (or reduce) DMA stalls. These stalls are not reflected in the static and dynamic analyses. In the process of adding double buffering, the inner loop is moved into a function, so that the code need not be repeated.
Following is the modified code for the multiple SPE Euler example.
The PPE code part for the tuned multi SPE Euler example remains unchanged from the previous step where multiple SPE code context are applied for all available SPEs.
Example: SPE
code for the tuned multiple SPEs:
#define END_OF_TIME 10
#define PARTICLES
100000
/*
4-D vector definition.
*/
typedef struct {
float x, y, z,
w;
}
vec4D;
#define PARTICLES_PER_BLOCK 1024 /* # of particles to process
at a time *
// Local store structures and
buffers.
volatile parm_context ctx;
volatile vector float pos[2][PARTICLES_PER_BLOCK];
volatile vector float vel[2][PARTICLES_PER_BLOCK];
volatile vector float
inv_mass[2][PARTICLES_PER_BLOCK/4];
void process_buffer(int buffer, int cnt, vector float
dt_v)
{
int j;
vector float dt_inv_mass_v;
for (j=0;
j<cnt; j++) {
pos[buffer][j] = spu_madd(vel[buffer][j], dt_v,
pos[buffer][j]);
dt_inv_mass_v = spu_mul(dt_v, spu_splats(inv_mass[buffer][j]));
vel[buffer][j] = spu_madd(dt_inv_mass_v, ctx.force_v,
vel[buffer][j]);
}
}
int main(unsigned long long spu_id __attribute__ ((unused)),
unsigned long long argv)
{
int buffer, next_buffer;
int cnt, next_cnt, left;
float time, dt;
vector float dt_v;
volatile vector float *ctx_pos_v,
*ctx_vel_v;
volatile vector float *next_ctx_pos_v,
*next_ctx_vel_v;
volatile float *ctx_inv_mass,
*next_ctx_inv_mass;
unsigned int tags[2];
// Reserve
a pair of DMA tag IDs
tags[0] = mfc_tag_reserve();
tags[1] = mfc_tag_reserve();
// Input
parameter argv is a pointer to the particle context.
// Fetch
the parameter context, waiting for it to complete.
mfc_write_tag_mask(1 <<
tags[0]);
mfc_get((void *)(&ctx), (unsigned int)argv,
sizeof(parm_context), tags[0], 0, 0);
mfc_read_tag_status_all();
dt = ctx.dt;
dt_v =
spu_splats(dt);
// For each step in time
for (time=0; time<END_OF_TIME; time += dt)
{
// For each double buffered block of
particles
left = ctx.particles;
cnt = (left < PARTICLES_PER_BLOCK) ? left : PARTICLES_PER_BLOCK;
ctx_pos_v = ctx.pos_v;
ctx_vel_v = ctx.vel_v;
ctx_inv_mass = ctx.inv_mass;
// Prefetch first buffer of input data.
buffer = 0;
mfc_getb((void *)(pos), (unsigned
int)(ctx_pos_v), cnt * sizeof(vector float), tags[0],
0,0);
mfc_get((void *)(vel), (unsigned
int)(ctx_vel_v), cnt * sizeof(vector float), tags[0],
0,0);
mfc_get((void *)(inv_mass), (unsigned
int)(ctx_inv_mass), cnt * sizeof(float), tags[0], 0,0);
while (cnt < left) {
left -= cnt;
next_ctx_pos_v =
ctx_pos_v + cnt;
next_ctx_vel_v =
ctx_vel_v + cnt;
next_ctx_inv_mass = ctx_inv_mass + cnt;
next_cnt = (left
< PARTICLES_PER_BLOCK) ? left
: PARTICLES_PER_BLOCK;
// Prefetch next
buffer so the data is available for computation on next loop
iteration.
// The first DMA is barriered so that we don't GET data before
the previous iteration's
// data is
PUT.
next_buffer =
buffer^1;
mfc_getb((void *)(&pos[next_buffer][0]), (unsigned
int)(next_ctx_pos_v), next_cnt * sizeof(vector float), tags[next_buffer],
0,0);
mfc_get((void *)(&vel[next_buffer][0]), (unsigned
int)(next_ctx_vel_v), next_cnt * sizeof(vector float), tags[next_buffer],
0,0);
mfc_get((void *)(&inv_mass[next_buffer][0]), (unsigned
int)(next_ctx_inv_mass), next_cnt * sizeof(float), tags[next_buffer],
0,0);
// Wait for
previously prefetched data
mfc_write_tag_mask(1 <<
tags[buffer]);
mfc_read_tag_status_all();
process_buffer(buffer, cnt, dt_v);
// Put the
buffer's position and velocity data back into system
memory
mfc_put((void *)(&pos[buffer][0]), (unsigned int)(ctx_pos_v),
cnt * sizeof(vector float), tags[buffer], 0, 0);
mfc_put((void *)(&vel[buffer][0]), (unsigned int)(ctx_vel_v),
cnt * sizeof(vector float), tags[buffer], 0, 0);
ctx_pos_v =
next_ctx_pos_v;
ctx_vel_v =
next_ctx_vel_v;
ctx_inv_mass =
next_ctx_inv_mass;
buffer = next_buffer;
cnt = next_cnt;
}
// Wait for previously prefetched data
mfc_write_tag_mask(1 <<
tags[buffer]);
mfc_read_tag_status_all();
process_buffer(buffer, cnt,
dt_v);
// Put the buffer's position and velocity data back into system
memory
mfc_put((void *)(&pos[buffer][0]), (unsigned
int)(ctx_pos_v), cnt * sizeof(vector float), tags[buffer], 0,
0);
mfc_put((void *)(&vel[buffer][0]), (unsigned
int)(ctx_vel_v), cnt * sizeof(vector float), tags[buffer], 0,
0);
// Wait for DMAs to complete before starting the next step in
time.
mfc_write_tag_mask(1 <<
tags[buffer]);
mfc_read_tag_status_all();
}
return (0);
}
Viewing
trace data for the tuned multiple SPE:
The following collected PDT traces after compiling the code with xlC –O5 level, show a clear improvement from more than ~5 million cycles to less than ~2.5 million, as shown below:
PPE and SPEs 0
thru 7:

SPEs 7 thru
15:

However, in order to further improve the performance we can replace the process_buffer function by an improved version which includes a loop unrolling by:
·
Unrolling
the loop to provide additional instructions for
interleaving.
·
Loading
the DMA-buffer contents into local nonvolatile registers to eliminate volatile
migration constraints..
·
Eliminating
the scalar load of the inv_mass
variable.
·
Eliminating
extra multiplies of dt*inv_mass
and
splat
the
products after the SIMD multiply, instead of before the multiply.
Example: Further tuned
SPE code for the process_buffer function:
void process_buffer(int buffer, int cnt, vector float
dt_v)
{
int i;
volatile vector float *p_inv_mass_v;
vector float force_v, inv_mass_v;
vector float pos0, pos1, pos2, pos3;
vector float vel0, vel1, vel2, vel3;
vector float dt_inv_mass_v, dt_inv_mass_v_0, dt_inv_mass_v_1,
dt_inv_mass_v_2, dt_inv_mass_v_3;
vector unsigned char splat_word_0 = (vector unsigned char){0,
1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3};
vector unsigned char splat_word_1 = (vector unsigned char){4,
5, 6, 7, 4, 5, 6, 7, 4, 5, 6, 7, 4, 5, 6, 7};
vector
unsigned char splat_word_2 = (vector unsigned char){8, 9,10,11, 8, 9,10,11, 8,
9,10,11, 8, 9,10,11};
vector
unsigned char splat_word_3 = (vector unsigned
char){12,13,14,15,12,13,14,15,12,13,14,15,12,13,14,15};
p_inv_mass_v = (volatile vector float *)&inv_mass[buffer][0];
force_v =
ctx.force_v;
// Compute
the step in time for the block of particles, four
// particle
at a time.
for (i=0; i<cnt; i+=4) {
inv_mass_v = *p_inv_mass_v++;
pos0 = pos[buffer][i+0];
pos1 = pos[buffer][i+1];
pos2 = pos[buffer][i+2];
pos3 = pos[buffer][i+3];
vel0 = vel[buffer][i+0];
vel1 = vel[buffer][i+1];
vel2 = vel[buffer][i+2];
vel3 = vel[buffer][i+3];
dt_inv_mass_v = spu_mul(dt_v,
inv_mass_v);
pos0 = spu_madd(vel0, dt_v,
pos0);
pos1 = spu_madd(vel1, dt_v,
pos1);
pos2 = spu_madd(vel2, dt_v,
pos2);
pos3 = spu_madd(vel3, dt_v,
pos3);
dt_inv_mass_v_0 = spu_shuffle(dt_inv_mass_v,
dt_inv_mass_v, splat_word_0);
dt_inv_mass_v_1 = spu_shuffle(dt_inv_mass_v,
dt_inv_mass_v, splat_word_1);
dt_inv_mass_v_2 = spu_shuffle(dt_inv_mass_v,
dt_inv_mass_v, splat_word_2);
dt_inv_mass_v_3 = spu_shuffle(dt_inv_mass_v,
dt_inv_mass_v, splat_word_3);
vel0 = spu_madd(dt_inv_mass_v_0, force_v,
vel0);
vel1 = spu_madd(dt_inv_mass_v_1, force_v,
vel1);
vel2 = spu_madd(dt_inv_mass_v_2, force_v,
vel2);
vel3 = spu_madd(dt_inv_mass_v_3, force_v,
vel3);
pos[buffer][i+0] = pos0;
pos[buffer][i+1] = pos1;
pos[buffer][i+2] = pos2;
pos[buffer][i+3] = pos3;
vel[buffer][i+0] = vel0;
vel[buffer][i+1] = vel1;
vel[buffer][i+2] = vel2;
vel[buffer][i+3] = vel3;
}
}
Viewing trace data for the PPE and each
SPE of the final tuned Euler example:
Resulted traces show another significant improvement in performance and a total of 1.6 Miliion cycles
PPE and SPEs 0 thru 7:

SPEs 7 thru 15:

References:
1. introductory overview of the Cell multiprocessor IBM JRD 49-4/5 | Introduction to the Cell multiprocessor
2. IBM Systems Journal issue 45-1, Online Game Technology IBM SJ 45-1 | Using advanced compiler technology
3. Introduction to the Cell Broadband Engine Cell Broadband Engine - IBM Microelectronics
4. Cell Broadband Engine™ processor – based systems White Paper
5. Spufs: The Cell Synergistic Processing Unit as a virtual file system
6. ”Maximizing the power of the Cell Broadband Engine processor: 25 tips to optimal application performance", by Dan Brokenshire
7.
FFTC : Fastest
Fourier Transform for the IBM Cell Broadband Engine David A. Bader and Virat
Agarwal College of Computing Georgia Institute of Technology,
Atlanta,
GA, USA 30332
8. The Potential of the Cell Processor for Scientific Computing, by Samuel Williams, John Shalf, Leonid Oliker, Shoaib Kamil, Parry Husbands, Katherine Yelick, Computational Research Division,Lawrence Berkeley National Laboratory, Berkeley, CA 94720
9.
FFTC : Fastest
Fourier Transform for the IBM Cell Broadband Engine
David A. Bader and
Virat Agarwal College of Computing Georgia Institute of Technology,
Atlanta, GA, USA 30332
10. White Paper - A Programming Example: Large Fast Fourier Transform ...
11. IBM BladeCenter QS20 Performance Potentials
12. ”Maximizing the power of the Cell Broadband Engine processor: 25 tips to optimal application performance", by Dan Brokenshire
13.
Cell BE programming Tutorial http://www-01.ibm.com/chips/techlib/techlib.nsf/techdocs/FC857AE550F7EB83872571A80061F788?Open&S_TACT=105AGX16&S_CMP=LP
14.
Thomas Chen, et al., “Cell
Broadband Engine Architecture and its First Implementation – a Performance
View”, http://www.ibm.com/developerworks/power/library/pa-cellperf/
15.
Samuel
Williams, et al., “The Potential of the Cell Processor for Scientific
Computing”,
http://www.cs.berkeley.edu/~samw/projects/cell/CF06.pdf#search=%22Cell%20performance%20%20FFT%22
16.
Carsten Benthin, et al., “Ray Tracing on the Cell
Processor”, http://graphics.cs.unisb.de/~benthin/cellrt06.pdf#search=%22InTrace%20Ray%20Tracing%20Performance%20Opteron%22
bltiQS20perfwp103106.doc Page 5
17.
Alex
Chunghen Chow, et al., “A Programming Example: Large FFT on the Cell Broadband
Engine”,
http://www.ibm.com/chips/techlib/techlib.nsf/techdocs/0AA2394A505EF0FB872570AB005BF0F1/$file/GSPx_FFT_paper_legal_0115.pdf#search=%22Cell%20performance%20%20FFT%22
18.
B. Minor, G. Fossum, V. To,
“Terrain Rendering Engine (TRE): Cell Broadband Optimized Real-Time Ray-Caster”,
Proceedings of GPSx conference, October 2005.