SA,
I have been working on ARM Platform these days, and I tried to use ARM Intrinsics for optimizing image processing functions that are used in OpenCV. Mastering SIMD is also useful for other applications like Embedded Software that needs critical timing and real time rendering like games.
A good place for optimization is to use the ARM Intrinsics and SIMD. The idea is to load, process the data by single instructions.
Most current mobile phones these days use an ARM based Architectures processors, knowing ARM Architecture is an essential knowledge to write games, augmented reality or any real time applications on iPhone or Android for examples.
A Simple example that I tried to work on is to subtract two images but with large resolutions like for example a 1024*1024*3 Channels. This operation takes a lot of cycles when using the normal non SIMD instructions. The normal way is to iterate over the width*height and subtract each RGB pixel from the source to the dest. Neon comes to rescue to load multiple pixels using one single instructions, and that save a lot of CPU cycles, hence better performance. A simple code is like the following:
I have been working on ARM Platform these days, and I tried to use ARM Intrinsics for optimizing image processing functions that are used in OpenCV. Mastering SIMD is also useful for other applications like Embedded Software that needs critical timing and real time rendering like games.
A good place for optimization is to use the ARM Intrinsics and SIMD. The idea is to load, process the data by single instructions.
Most current mobile phones these days use an ARM based Architectures processors, knowing ARM Architecture is an essential knowledge to write games, augmented reality or any real time applications on iPhone or Android for examples.
A Simple example that I tried to work on is to subtract two images but with large resolutions like for example a 1024*1024*3 Channels. This operation takes a lot of cycles when using the normal non SIMD instructions. The normal way is to iterate over the width*height and subtract each RGB pixel from the source to the dest. Neon comes to rescue to load multiple pixels using one single instructions, and that save a lot of CPU cycles, hence better performance. A simple code is like the following:
// load 8 pixels
uint8x8_t srcPixels = vld3_u8 (src);
uint8x8_t dstPixels = vld3_u8 (src);
// subtract them
uint8x8t subPixels = vsub_u8(srcPixels, dstPixels);
// store the result
vst1_u8 (result, subPixels);
No comments:
Post a Comment