Overview CBE Documentation Examples Download Contact Imprint
© 2009 IPK Gatersleben Contact

Cell Programming

For a successful application port to the Cell Processor, first the developer must understand the Cell architecture, second a number of special code optimisations are required. Especially SPE code tuning can result in a considerable application speedups compared to conventional processors. Here we provide some basic examples, how to exploit the great performance of the Cell Broadband Engine. In most cases significant improvements can be achieved by following these rules:
  1. Schedule the work on all cores
  2. Use vector operations
  3. Eliminate or reduce branches
  4. Avoid 32-bit Integer multiplications
  5. Unroll loops
Applying the steps above, benchmark results close to the processors peak performance are possible.

The following optimisation examples be explained in C language with SPE specific intrinsics and SIMD extensions provided by IBMs Software Development Kit (SDK) for multi-core Acceleration.


Because the Cell is a multi-core processor, an effcient partitioning of the computations on all processing units.is required. It is recommended that the SPEs performs the computational intensely work and the PPE organise the task flow for them. A necessary condition is a possible decomposition of the overall calculations. Typically the main application is launched on the PPE, which distributed jobs and data to the SPEs. The following code fragment demonstrates CBE-specific partioning strategies

//SPE threads
speid_t spe_ids[NUM_SPES];
//data struct
context ctx[NUM_SPES];

// create SPE threads and load context into SPE local storage		
for(i=0; i < NUM_SPES; i++) { 
	spe_id[i] = spe_create_thread(0, &spe_procedure, &ctx[i], NULL, -1, 0);
// wait for all SPE threads to complete	
for(i=0; i < NUM_SPES; i++) {
	spe_wait(spe_id[i], &status, 0);


The eight SPEs on the Cell are optimized for Singe-Instruction-Multiple-Data operations. Code SIMDisation goes hand-in-hand with vectorisation of data and instructions. This section shows a typical example how the programmer can convert code that is based on scalar data types and operations to a code that is based on vector data types and SIMD operations. Even though compilers (like GCC) automatically aggregate scalar data into a parallel SIMD data structure, these capabilities remain limited, so it is recommended to do this task manually. The following code fragment implements such an optimisation step using IBM SDK language extensions for C/C++.

// rotate an image[xMax][yMax] by alpha degrees
for(y=0; y<yMax; y++) {
	for (x=0; x<xMax; x++) {
		newX = cos(alpha) * oldX + sin ( alpha ) * oldY + centerX;
		newY = -sin(alpha) * oldX + cos(alpha) * oldY + centerY ; . .
		rotImage[x][y] = image [newX][newY];

// same procedure vectorised for the Cell
// define vector variables
vector signed int vec_oldX, vec_oldY, vec_newX, vec_newY;
vector float sin(alpha), cos(alpha);
vector unsigned int vec_centerX, vec centerY;

for (y=0; y<yMax; y++) {
	for (x=0; x<xMax; x+=4) {
		vec_newX = vec_cos(alpha) * vec_oldX 
			 + vec_sin(alpha) * vec_oldY + vec_centerX;
		vec_newY = vec_sin(alpha) * vec_oldX 
			 + vec_cos(alpha) * vec_oldX + vec_centerY;
. . .	
		//extract new coordinates from the vector
		rotImage[x][y] = image[spu_extract(vec_newX, 0 )]
				      [spu_extract(vec_newY, 0)];
		rotImage[x+1][y] = image[spu_extract(vec_newX, 1)]
					[spu_extract(vec_newY, 0)];
		rotImage[x+2][y] = image[spu_extract(vec_newX, 2)]
					[spu_extract(vec_newY, 0)];
		rotImage[x+3][y] = image[spu_extract(vec_newX, 3)]
					[spu_extract(vec_newY, 0)];
	}. . .

Branch reduction

The SPE is not optimized for branching structures within the application source. A mispredicted branch incurs a penalty up to 19 waiting cycles, which can dramatically reduce the program performance. To avoid this, the programmer can make the source code branchless by computing all possible results and selecting the correct one. The listing below shows such a branch elimination in the case of a typical if-else condition.

//a typical branch
if(vec_newX < 0 && vec_newX > xMax && vec_newY < 0 && vec_newY > Ymax) {
	vec_rotImage = vec_background;

//condition made branchless with IBM SDK instructions
vec_select = spu_or(spu_cmpgt(0, vec_newX), 0);
vec_select = spu_or(spu_cmpgt(vec_newX, vec_xMax), vec_select);
vec_select = spu_or(spu_cmpgt(0, vec newY), vec_select);
vec_select = spu_or(spu_cmpgt(vec_newY, vec_Ymax), vec_select);
vec_rotImage = spu_sel(vec_rotImage, vec_background , vec_select);

Certainly this optimisation resulted in much more code lines and single calculations, but gains a significant speedup on a SPE.

Loop unrolling

One of the most common methods for SIMD programming is loop unrolling. Even if modern compilers can automatically schedule operations for dual issues, it seems to be useful to implement several levels of unrolling, especially on long or nested loops. This technique can find an optimal usage of the SPEs large 128x128 bit register file, which can result in a better application performance. The following code fragment illustrates a fourfold unrolling of a nested loop.

//Translate an image 
//vectorised but no loop unrolling
for (y=0; y<yMax; y++) {
	for (x=0; x<xMax; x+=4) {
		vec newX = . . .
		vec newY = . . .
		//translate.each component of the vector. .
		transImage[x][y] = . . .
		transImage[x+3][y] = . . .

//same translate function
//vectorised and fourfold unrolled
for(y=0; y<yMax; y++) {
	for(x=0; x<xMax; x+=16) {
		vec_newX[x] = . . .
		vec_newX[x+1] = . . .
		vec_newX[x+2] = . . .
		vec_newX[x+3] = . . .
. . .		
		//translate 4x4 components
		transImage[x][y] = . . .
. . .		transImage[x+15][y] = . . .

This site gives only an overview of some basic optimisation rules for the Cell Broadband Engine. For detailed informations, explanations and further examples, the references below might be useful.

Further Readings