Performance research

After I managed to get the Android emulator to make some noise, I decided it was a good time to experiment and research what was the best way to get better sound performance with Android.

On my last post I had expressed some doubts about my own code: was I optimizing too early? So I tried several things to prove myself wrong (or right). It turned out I was right in trying to optimize from the beginning:

  1. First I changed all the calls to my humble sin Look Up Table (LUT) with calls to the real, meaty Math.sin function. It works with the double data type -- so that means that it would require a decent Floating Point Unit (FPU) for delivering all the data in time. Consequently, it all ended with stuttered sound, and recurrent gaps in the generated wave, because the thread couldn't fill the audio buffer in time.
  2. Then I tried using Toxi's Trigonometric LUT functions and fast sin function, if only to compare them with my rudimentary LUT techniques. Unfortunately, although they did well with my first and only 64k intro, they still use float, so I was still getting gaps in the audio.

So it seemed that my approach was heading towards the right direction. What I mean with my approach is that when I precompute the values of the LU tables, I am calculating shorts already. Why? Because the final output (i.e. the audio buffer) uses shorts as well. It is quite a change for me, because in my previous attempts at sound synthesis I was using floats internally all the way, and only converted to integers in the last phase, when filling in the buffer that would be returned to the callback function.

Changing from floats to integers/shorts means I have to change the usual mental scale as well. You can't think in terms of the oscillator values being in a [-1, 1] range, but in the [-32767, 32767] instead -- only without intermediate (fractional) values. That effectively reduces precision, but I think it's fine to compromise a bit on the audio quality in this case: it's not like your phone is connected to high end speakers.

Just to clarify: I am not using fixed point math, and I am unsure whether that's a good idea. For starters, I have never written any fixed point routine, let alone use it (that I am aware of). It seems there are libraries which you can use, but it seems their performance is not exactly stellar with devices such as this phone.

So what I am doing is using integers and shorts as much as possible, with the occasional use of a float here and there. I am also trying to simplify the work done inside loops, so for example I extract some multiplications and try to convert things to sums only. Hopefully this approach will allow me to have some extra precious spare CPU cycles for graphics in the application :-P

And what do Android developers say?

There's a veeery long thread discussing these points at the android developers group. This is what a couple of Android developers say:

Don't count on floating point hardware being the common case. The mobile space is very different than what you may be used to in the desktop world, in that there are two factors that drive mobile device design just as much as performance does in the desktop world: battery life and cost. Including floating point hardware support has a negative impact on both of these, so if the performance gain isn't going to strongly help sell the phone then there is very little incentive to have it.

Another good thing to be aware of about the mobile space is the range of hardware is much broader than desktops. That is, the difference in performance between a low-end and high-end mobile device is much larger than corresponding difference in desktop systems. In addition, most of the phones in use are down in the low-end spectrum, so if you want to have any broad use of your software then you need to think about how it will work on those low-end devices.

(by Dianne Hackborn)

and this one at the end:

A single interpreted dalvik instruction, be it a "NOP", costs about the same than a softfp multiply. A float-to-int or int-to-float operation costs about 15 CPU cycles. At the lowest level, a softfp add or mul costs about 20 CPU cycles. At the lowest level, a integer add costs 1 CPU cycle, while a multiply costs between 2 and 6 cycles. An integer divide can cost up to 100+ CPU cycles, be it integer or softfp Most ARM CPUs don't even have an integer divide instruction (it's done by hand, just like in elementary school, but in binary! :)).

So, to conclude:

  • avoid divides/modulos at all cost float or integer

  • floating point operations in interpreted java language/dalvik are only roughly half the speed of the equivalent integer operation, so it's not /that/ slow.

  • try to avoid using floats too much, but don't sweat it, simply try to take as much computation as possible out of tight loops.

  • use integer whenever it makes sense and as much as possible

  • in the case of a JITted VM or native code, avoid floats like the plague, since they are an order of magnitude slower than their integer counterparts.

by Mathias Agopian.

So it looks like I am on the right track. Winden also confirmed that the best thing is to avoid floats in ARM-powered devices (by email) and I believe if so many people point to this path, it must be the right one, mustn't it? :D

With a bit of luck, the next post should be about my first impressions about OpenGL ES in Android.