Arduino execution time analysis
Recently I was asked to take a look at some code for the Arduino Yun that was developed as a proof of concept for a medical device. The company developing the device was “running out of room” and couldn’t get the Arduino to sample all the ADC inputs in the time period they wanted. I suspect they mean they are running out of processing time to sample/control multiple outputs within their processing window. I have always worked a level down from the Arduino wiring framework and honestly have not worked with the Arduino family much, but when a contract job comes up, you take it. The first thing I noticed in the code was heavy use of floating point values as parameters and returns for lots of functions. Anyone who is embedded knows that floating point operations take WAY more time than integer math, but I was curious as to how much longer. I didn’t find any good online resources that say “floating point divides take XXX instructions”, so I decided to get that information myself.
While I don’t have a Yun, I do have an Arduino Uno which uses the ATMega328, an 8-bit micro with most instructions running at 1 instruction per clock using a 16MHz crystal. I decided to look at each basic math operation for unsigned integer types and floating point types. All input and output variables were created as volatile so the compiler wouldn’t optimize, each operation was performed 1000 times and the results are the average. The method of recording the times was the micros() function which has an accuracy of 4us and does include an unsigned long shift, adding a uint8_t and a multiply by a uint16 literal value. Here are the results on the Uno:
Starting test, looping 1000 times
Control 10ms loop: time 10009 us
Float div: 34 us ~544 instructions, 29/ms
Float mul: 12 us ~192 instructions, 83/ms
Float add: 11 us ~176 instructions, 90/ms
Float sub: 11 us ~176 instructions, 90/ms
uint8 div: 8 us ~128 instructions, 125/ms
uint8 mul: 3 us ~48 instructions, 333/ms
uint8 add: 3 us ~48 instructions, 333/ms
uint8 sub: 3 us ~48 instructions, 333/ms
uint16 div: 16 us ~256 instructions, 62/ms
uint16 mul: 4 us ~64 instructions, 250/ms
uint16 add: 3 us ~48 instructions, 333/ms
uint16 sub: 3 us ~48 instructions, 333/ms
uint32 div: 41 us, ~656 instructions, 24/ms
uint32 mul: 9 us, ~144 instructions, 111/ms
uint32 add: 4 us, ~64 instructions, 250/ms
uint32 sub: 4 us, ~64 instructions, 250/ms
What is interesting is that a uint32_t divide takes more time than a floating point divide! Overall it is clear to see that as the integer gets larger the processing time increases. Floating point operations are 3-4 times as long as integer operations (except for uint32_t divides). I estimated the number of instructions per operation based on 16MHz and how many of each operation could be performed in 1ms.
The next step is to get these values for the Yun and look for performance improvement areas.
Here is the code I used to get these values: http://protological.com/browser/files/timer_sketch.ino.
(Here is the same code but with a macro function, it’s a little cleaner to look at)