Wednesday, November 24, 2010

Improving Codec2 Decode Speeds with Alternate FFT Algorithms

Codec2 operates nicely in real-time on a modern-class laptop, but I've been trying to push it into other devices. Real-time encoding and decoding on PIC32s will need to wait for a 16 bit integer version (which can make use of Microchip's DSP library), but how about ARM hardware? I ported the code to a Nokia N800 (400MHz ARM11), and it works nicely, but the decoding is choppy after a while because it just can't spit out audio in real time. Decoding will need about a 2x speed up. I've been able to get pretty close to that by swapping out Dave's Fast Fourier Transform function with a different one licensed under the GPL.

 Codec2, being a sound codec, makes frequent use of FFTs. Here's the result of gprof running over c2dec:
%   cumulative   self              self     total           
 time   seconds   seconds    calls  us/call  us/call  name    
 83.52      3.65     3.65    28116   129.82   129.82  four1a
  7.78      3.99     0.34    11246    30.23   160.05  aks_to_H
  3.43      4.14     0.15     5623    26.68   156.50  aks_to_M2
  2.75      4.26     0.12   599720     0.20     0.20  sample_log_amp
  1.37      4.32     0.06    11246     5.34   135.15  synthesise
  ...
You can see that four1a, the FFT routine, is taking up 3.65 seconds, and the total turns out to be 4.37! So the FFT is a great candidate for improvement, and happily the current code uses a very slow FFT, the one from Numerical Recipes in C.

With a bit of work, I swapped in the kiss FFT code. This is by no means top-notch in the speed derby, but it should be better. It has the advantage of being small and simple. Here's the same process, on the same machine using that code:
%   cumulative   self              self     total           
 time   seconds   seconds    calls  us/call  us/call  name    
 49.43      1.30     1.30    28116    46.24    46.24  kf_work
 15.97      1.72     0.42    11246    37.35    92.12  aks_to_H
  9.13      1.96     0.24    28116     8.54    54.77  four1
  6.46      2.13     0.17     5623    30.23    85.01  aks_to_M2
  6.08      2.29     0.16    11246    14.23    69.00  synthesise
...
Total time turns out to be 2.63 sec. We have to be careful to charge both kf_work and four1 against FFT, because the 'four1' routine is my bridging code.  Still, this is quite the improvement.

On decode, anyway, FFT remains the big time-hog. We can improve it further, by about another 50%, if we treat all transforms as entailing real numbers. kiss FFT has a real number routine on board, but I'm having trouble interfacing this with Dave's existing code. Something's just not aligning properly.

Finally, the gold standard in  C code FFT speed appears to be the FFTW library. Implementing this should improve matters greatly again, although its interface seems a bit hairy. After that, it's hardware DSP code. The N800, for example, actually has a DSP engine which could be engaged to this purpose quite well, but I can't find if or where a FFT routine is exposed usefully. In any case, the N800 is a pretty old machine: what barely works on it should do just fine on more recent hardware.

No comments:

Post a Comment