Tuesday, March 19, 2013

Using ARM Neon Intrinsics for Image Processing

SA,
I have been working on ARM Platform these days, and I tried to use ARM Intrinsics for optimizing image processing functions that are used in OpenCV. Mastering SIMD is also useful for other applications like Embedded Software that needs critical timing and real time rendering like games.
A good place for optimization is to use the ARM Intrinsics and SIMD. The idea is to load, process the data by single instructions.

Most current mobile phones these days use an ARM based Architectures processors, knowing ARM Architecture is an essential knowledge to write games, augmented reality or any real time applications on iPhone or Android for examples. 

A Simple example that I tried to work on is to subtract two images but with large resolutions like for example a 1024*1024*3 Channels. This operation takes a lot of cycles when using the normal non SIMD instructions. The normal way is to iterate over the width*height and subtract each  RGB pixel from the source to the dest. Neon comes to rescue to load multiple pixels using one single instructions, and that save a lot of CPU cycles, hence better performance.  A simple code is like the following: 
        // load 8 pixels
        uint8x8_t srcPixels  = vld3_u8 (src);
        uint8x8_t dstPixels  = vld3_u8 (src);
        // subtract them
        uint8x8t subPixels =  vsub_u8(srcPixels, dstPixels);
        // store the result
        vst1_u8 (result, subPixels);

Monday, March 11, 2013

Using ARM Neon to optimize square root function

SA,

I was working on a platform that has an ARM A9 CPU, Exynos4412 Samsung, and had few computer vision codes that has few calculations for sqrtf, and it was really slow :-).

Calculating the square root is helpful in many cases, especially in game development to calculate the distances between points,..etc. Everyone who worked on game development especially at the era of software rendering and even on modern CPUs, know that using sqrt of the cmath.h is really heavy on the processor. The same for calculating sin, cos, one of the tricks is to use a look up table that has the precomputed values of most angles, and then you retrieve the result from the table when you need it.

Carmak who is the lead programmer of  most ID games, quake, doom,  had written a very fast square root function that uses newton raphson approximation, and that was really pretty fast, the function can be found in the quack 3 source code.I have tried to use that function and it was also fast, but not as fast as using the neon intrinsics of ARM.

I have written a small test application which uses the sqart of cmath, and one that uses ARM neon and that was the result: Using neon intrinsics with square root was two times faster than the normal square root.



I also used openMP get_time to measure the time slice between the two functions.