Post tutorial RSS NEON Assembly Programming on the iPad/iPhone 3GS

A starting guide to optimization using the NEON unit on various iDevices.

Posted by on - Intermediate Client Side Coding

In order to provide players with endless game-play in Trudy's Mechanicals we chose to procedurally generate maps. However this poses the problem that we can not pre-bake any of the lighting off-line. In order to get the best graphics as possible in Trudy, we chose to generate our lighting on the fly and it had to be as fast as possible. Luckily these kinds of calculations are easily parallelised using the iPhone 3GS/4 and iPad's NEON unit!

The Architecture

For the sake of this article I'm going to be assume that you have some grasp of the ARM architecture and assembly programming in general. NEON is a super-set of the existing VFP floating point units found in older iDevices and features sixteen 128bit quad-word SIMD registers.

Each 128bit register can be thought of as two 64-bit registers or four 32-bit registers. These registers are named as follows:

NEON instructions operate on "qX" and "dX" registers. VFP instructions operate on the first 16 "dX" and the first 32 "sX" registers. For more information see the official ARM docs.

Configuring your project in XCode

Any assembly you write will only work on your iDevice because the Apple chose to create a simulator instead of an emulator for debugging code on your Mac. Including "TargetConditionals.h" imports several defines that allow us to figure out what sort of compilation is occurring. The most important is TARGET_IPHONE_SIMULATOR which is set when compiling for the simulator.

c code:
#include "TargetConditionals.h"


// put your non-optimized C/Objective-C code here


// put your assembly here


XCode itself is geared towards creating binaries for all iPhone/iPad devices. In order to compile for the iPhone 3GS/4 or iPad we must change some build settings:

1) Un-check "Compile for Thumb"

Compile for Thumb

2) Set "Valid Architectures" to "armv7" (Yes this doesn't say Cortex-A8 but its fine) and check "Build Active Architecture Only".

Valid Architectures

3) Under the Build drop-down un-check armv6 if it is.

Build Dropdown

An Example

Most of the time the compiler will do a much better job creating assembly code than you, however transposing a 4 dimensional matrix is a great use case for switching to assembly. Traditionally you'd be forced to interchange the various elements of a matrix with several swaps. Using the NEON unit we can load the entire matrix into memory and using some trickery swap the elements and write them back.

c code:
#include "TargetConditionals.h"

void mat4_transpose(float mat[4][4])



float tmp;

tmp = mat[1][0];

mat[1][0] = mat[0][1];

mat[0][1] = tmp;

tmp = mat[2][0];

mat[2][0] = mat[0][2];

mat[0][2] = tmp;

tmp = mat[2][1];

mat[2][1] = mat[1][2];

mat[1][2] = tmp;

tmp = mat[3][0];

mat[3][0] = mat[0][3];

mat[0][3] = tmp;

tmp = mat[3][1];

mat[3][1] = mat[1][3];

mat[1][3] = tmp;

tmp = mat[3][2];

mat[3][2] = mat[2][3]

mat[2][3] = tmp;


__asm__ volatile (

// load the matrix into q0,q1,q2,q3

"vldmia %0, {d0,d1,d2,d3,d4,d5,d6,d7} \n\t"

Hockeyman56899 - - 1 comments

Thumbs up man. This was a well defined explanation. Thanks!

Reply Good karma Bad karma+1 vote
Post a comment
Sign in or join with:

Only registered members can share their thoughts. So come on! Join the community today (totally free - or sign in with your social account on the right) and join in the conversation.