Tuesday, August 26, 2014

Progress

Today I have finished the Sprite Animation class using DirectX 11.1. I have taken the shortcut way and had used the DirectX Tool Kit. I was surprised that D3DX has been removed...
The Engine is located here: https://code.google.com/p/directxgameengine/

Wednesday, August 20, 2014

New Updates

I have started the embedded systems series by discussing video game consoles design. Fortunately, after a year or so, I own now a kit that has an ARM Cortex M3 connected by a Graphical LCD.

I have compiled about 4 videos in Arabic explaining the basics of Software Rendering here: 

https://www.youtube.com/playlist?list=PLHgpSHDk3q4D8UiZfNDUMVwIAEv9epi_X 

Here is the kit and the algorithms in action:

https://www.youtube.com/playlist?list=PLHgpSHDk3q4DIUuYhx_rfOaHUhzQiMbUo


I started working on the hardware rendering by writing a small 3D/2D Game Engine using DirectX 11.1 
https://code.google.com/p/directxgameengine/

My plan is to teach a course about it in Arabic. 

Tuesday, March 19, 2013

Using ARM Neon Intrinsics for Image Processing

SA,
I have been working on ARM Platform these days, and I tried to use ARM Intrinsics for optimizing image processing functions that are used in OpenCV. Mastering SIMD is also useful for other applications like Embedded Software that needs critical timing and real time rendering like games.
A good place for optimization is to use the ARM Intrinsics and SIMD. The idea is to load, process the data by single instructions.

Most current mobile phones these days use an ARM based Architectures processors, knowing ARM Architecture is an essential knowledge to write games, augmented reality or any real time applications on iPhone or Android for examples. 

A Simple example that I tried to work on is to subtract two images but with large resolutions like for example a 1024*1024*3 Channels. This operation takes a lot of cycles when using the normal non SIMD instructions. The normal way is to iterate over the width*height and subtract each  RGB pixel from the source to the dest. Neon comes to rescue to load multiple pixels using one single instructions, and that save a lot of CPU cycles, hence better performance.  A simple code is like the following: 
        // load 8 pixels
        uint8x8_t srcPixels  = vld3_u8 (src);
        uint8x8_t dstPixels  = vld3_u8 (src);
        // subtract them
        uint8x8t subPixels =  vsub_u8(srcPixels, dstPixels);
        // store the result
        vst1_u8 (result, subPixels);

Monday, March 11, 2013

Using ARM Neon to optimize square root function

SA,

I was working on a platform that has an ARM A9 CPU, Exynos4412 Samsung, and had few computer vision codes that has few calculations for sqrtf, and it was really slow :-).

Calculating the square root is helpful in many cases, especially in game development to calculate the distances between points,..etc. Everyone who worked on game development especially at the era of software rendering and even on modern CPUs, know that using sqrt of the cmath.h is really heavy on the processor. The same for calculating sin, cos, one of the tricks is to use a look up table that has the precomputed values of most angles, and then you retrieve the result from the table when you need it.

Carmak who is the lead programmer of  most ID games, quake, doom,  had written a very fast square root function that uses newton raphson approximation, and that was really pretty fast, the function can be found in the quack 3 source code.I have tried to use that function and it was also fast, but not as fast as using the neon intrinsics of ARM.

I have written a small test application which uses the sqart of cmath, and one that uses ARM neon and that was the result: Using neon intrinsics with square root was two times faster than the normal square root.



I also used openMP get_time to measure the time slice between the two functions.

Friday, October 26, 2012

Multiprogramming in Embedded Software

SA,..
I know this is pretty long topic, and discussed heavily while writing multi-threaded applications for desktop, but I have found few books that tackle that subject in Embedded Software. I will also review some basic operating system concepts, that can be found in any real time operating system book.

Embedded Systems Architecture

The minimum software architecture to construct a simple embedded system or even fifa2012 is the following code:

for games, the game ends when the user quits the game by himself
while(!exit)
{
render_graphics();
updateAI();
updatePhysics();
}
The same for an Embdded System, except that the system will exit or shutdown when the user switches off the power.
while(1)
{
readAdc();
calculateWeight();
sendtoPC();
}
This is AKA endless loop, and in games it's called Game Loop.
For Embedded Systems, that basic architecture is simple, efficient (no need for timers or other uC resources). On the other hand, if your application requires to read an ADC data precisely for example 2ms, that architecture won't provide you with the flexibility or the accuracy for that task.
Another issue, that the uC will be busy all the time and it will use it's full power, and that have a dramatic impact on the power consumption, especially if you run your system using batteries...

Basically, you need a scheduler to solve the previous issues.    

Basic  OS Concepts

I remember when I got my hands on windows 98, and I finally managed to play, and was surprised by the computer tech-guy while explaining to me how nice was windows98 that it can play Fifa98, and at the same time you can listen to win-amp. It was interesting feature of the windows series, as the previous operating system, windows 3.11,  it didn't support multi-tasking. 

Scheduling is the method that let processes control the cpu time or its working power. By letting the CPU to switch among the processes like a game or windows media player, it can let the computer be more productive. In a comptuer which has only a single CPU, it can only run one process at a time, like a weight a scale, it only can read the weight from the load cell,..etc. The idea of multi-tasking or multi-threaded applications is to have some process running at all times, in order to utilize the full speed of the CPU at max. The basic idea to achieve that, is to execute the process until it waits for a completion of an I/O request. In a simple system like a small embedded system, the CPU just sits idle, and all that time waiting for an I/O Completion is wasted. Most OSs, several processes are kept in memory at one time, when a process in a wait condition, the OS let the CPU to switch to another process. That previous selection process is a property of a CPU Scheduler.

There are two different types of CPU Scheduling 
1. Non-Preemptive Scheduling, once the CPU has been allocated to a process, the process keeps the CPU until it releases the CPU either by terminating (ESC) or by switching to the waiting state (ALT-TAB ;) ). This scheduling method was used by windows 3.1. It provides a single tasking system architecture. 

2 Preemptive Scheduling,  scheduling is prioritized. The highest priority process should always be the process that is currently utilized by the CPU, Windows 95,98..etc works by that technique. it provides a Multitasking system Architecture.

There are different scheduling algorithms like round-robin, you can read more about them at any OS Book.

Next blog isA, will discuss Critical Sections, and issues in multi-tasking, and form a design pattern for that.

Wednesday, October 24, 2012

Design Patterns for Embedded Software (1)



SA..

Introduction

Design patterns is a group of  reusable solutions for problems that appear in software design. Those solutions commonly appear while writing high level object-oriented software for a desktop. A great book that is usually called gang of four book is Design Patterns: Elements of Reusable Object-Oriented Software, it explains design patterns in details .

Commonly, It is very rare to work with objects oriented programming while writing a firmware, however there are c++ compilers available like IAR C/C++ Compiler. I have searched also for resources for commonly design patterns for embedded software and I only found one book that explains design patterns for embedded software, and he tried to mix Gang of Four Patterns, like observer, strategy pattern,..etc and tried to give them the taste for firmware. The book is Design Patterns for Embedded Systems in C: An Embedded Software Engineering Toolkit . The book is really bad and I didn't like it at all, considering using UML for C Code is awful.

1. Polled Input Pattern

The first pattern that I would like to discuss is the Polled Input. Sometimes when you try to read an input from a switch for example, you try to poll switch for reading by a loop like that while(switchInput!=0);   This means the microcontroller will stuck here waiting for the switch to be at ground. This kind of reading an input, is okey for not real time or time triggered embedded systems, you would better utilize an ISR(Interrupt Service Routine) that keeps checking the sensor falling or rising edge, but as you know it will let the CPU to stop it's current task and then completes the ISR. This kind of  Context Switching is an overhead, and rises serious complications in embedded systems. 

The Pattern that solves that should have the following properties:
1. In a real time operating system, or any scheduler  a period task should poll the input for the occurence of the event
2.The period of the  task should Ttask should be <= min Tevent.
Suppose you would like to poll a push button as shown in the following figure with a pull up resistor 10k and VCC. A common problem for push-button or any switch is the residual frequencies due to mechanics, and you need to filter and "debounce" them. Of course you can use the electronics way of filtering the noise, and that's called a hardware filtering and of course the easier way is use Software for filtering those spikes.
Following figure, shows the spikes from pushing on/off a switch.

A simple code that shows the pattern idea



Note that the previous update function should be run in a task scheduler or timer  for example every 50ms to 500ms. Note it also solves the debouching issue with switches.

Saturday, October 6, 2012

In the beginning there was a pixel..

Quoting from the bible JN1-1, it all started with plotting a pixel on the screen. If have you managed to draw a pixel on the screen, then you can write Fifa2012 on your microcontroller, of course on a high specs one :).
By drawing a pixel on the screen, you can use bresenham's line algorithm to draw a line, then you can draw a triangle, then a polygon that is simplified of many triangles, texture, shade, and light it. A great outdated, but still useful resource for all these algorithms is that  Book : Computer Graphics, Principles and Practice, Foley, Vandam.

One of the first game consoles that tried to 3D Software Rasterization is the 3DO Console. It has an ARM6 32bit, clocked by 12.5MHz!!. For Embedded guys, I'm sure you will be surprised about how the low were the specs, and how beautiful games were made on that console and now we a smaller, lower power, faster microcontrollers like PIC32.. 3DO Specs
Anyway,
How can we draw a frame of NTSC Signal ?

The pseudo-code for the algorithm is simple:

1. Draw 8 scan lines  // top over scan
2. Draw 240 scan lines // the active video
3. Draw 10 scan lines // bottom over scan
4. Generate VSync pulse // send the sync pulse to redraw
5. Delay 6 scan lines 

A nice atmel application note for AVR for generating NTSC signals can be found here:
http://www.atmel.com/Images/mega163_3_04.pdf