Recently I was asked to take a look at some code for the Arduino Yun that was developed as a proof of concept for a medical device. The company developing the device was “running out of room” and couldn’t get the Arduino to sample all the ADC inputs in the time period they wanted. I suspect they mean they are running out of processing time to sample/control multiple outputs within their processing window. I have always worked a level down from the Arduino wiring framework and honestly have not worked with the Arduino family much, but when a contract job comes up, you take it. The first thing I noticed in the code was heavy use of floating point values as parameters and returns for lots of functions. Anyone who is embedded knows that floating point operations take WAY more time than integer math, but I was curious as to how much longer. I didn’t find any good online resources that say “floating point divides take XXX instructions”, so I decided to get that information myself.
While I don’t have a Yun, I do have an Arduino Uno which uses the ATMega328, an 8-bit micro with most instructions running at 1 instruction per clock using a 16MHz crystal. I decided to look at each basic math operation for unsigned integer types and floating point types. All input and output variables were created as volatile so the compiler wouldn’t optimize, each operation was performed 1000 times and the results are the average. The method of recording the times was the micros() function which has an accuracy of 4us and does include an unsigned long shift, adding a uint8_t and a multiply by a uint16 literal value. Here are the results on the Uno:
Starting test, looping 1000 times
Control 10ms loop: time 10009 us
Float div: 34 us ~544 instructions, 29/ms
Float mul: 12 us ~192 instructions, 83/ms
Float add: 11 us ~176 instructions, 90/ms
Float sub: 11 us ~176 instructions, 90/ms
uint8 div: 8 us ~128 instructions, 125/ms
uint8 mul: 3 us ~48 instructions, 333/ms
uint8 add: 3 us ~48 instructions, 333/ms
uint8 sub: 3 us ~48 instructions, 333/ms
uint16 div: 16 us ~256 instructions, 62/ms
uint16 mul: 4 us ~64 instructions, 250/ms
uint16 add: 3 us ~48 instructions, 333/ms
uint16 sub: 3 us ~48 instructions, 333/ms
uint32 div: 41 us, ~656 instructions, 24/ms
uint32 mul: 9 us, ~144 instructions, 111/ms
uint32 add: 4 us, ~64 instructions, 250/ms
uint32 sub: 4 us, ~64 instructions, 250/ms
What is interesting is that a uint32_t divide takes more time than a floating point divide! Overall it is clear to see that as the integer gets larger the processing time increases. Floating point operations are 3-4 times as long as integer operations (except for uint32_t divides). I estimated the number of instructions per operation based on 16MHz and how many of each operation could be performed in 1ms.
The next step is to get these values for the Yun and look for performance improvement areas.
Here is the code I used to get these values: http://protological.com/browser/files/timer_sketch.ino.
(Here is the same code but with a macro function, it’s a little cleaner to look at)
Are you sure that your testing rig isn’t dominating the results? My guess is that the looping is taking more time than the math.
uint8 add, for instance, should really only take one instruction and should definitely be faster than a uint8 multiply. (I’ll give or take a couple more instructions for queuing up the variables in the right registers, etc.)
An easy way to time these sort of things is to write your code so that it toggles an output pin before and after the operation you’re interested in, and then read the timings directly off of the pulse width on an oscilloscope.
A better way to do this timing comparo is to look at the disassembled code. The only downside is that most AVR calls take one cycle, but some take two. So you’ll have to do some hasslesome accounting. But you’ll know _exactly_ how many cycles your code is taking.
Finally, the GCC compiler that you’re probably using (and the AVR math libraries) are pretty smart about taking numerical shortcuts when possible. For example, multiplies and divides by 2 are done very quickly by bit-shifting the result, while multiplies by non-powers-of-two take a lot longer.
In short, here be dragons.
Best,
Elliot.
Elliot,
Thanks for the feedback. I tried to reduce the amount of extra processing for each operation by pulling the result of micros() into variables just before and just after the operation, then doing the processing and looping. I expect there is a good amount of baseline error which would affect the smaller operations. The main focus was to quantify the magnitude of processing for float on a 8bit system so I can explain to the client that having every argument as float isn’t helping.
To be honest I did start this little experiment by trying to use timer 1 as a stopwatch, but after some fiddling I decided to use the quicker and dirtier micros() function.
I totally agree, looking at the disassembly would be the best and that’s what I tend to do on other systems (PIC, 8051s, etc) but as my first foray into the Arduino buildchain I didn’t dig that deep.
I suspect that the uint32 divide is longer than the floating divide is that the mantissa that has to be divided is only 24-bits for the floating divide and is 32-bits for the uint32 divide. So your observation that more digits take more time is correct and the result is quite logical.