Improving Codec2 Decode Speeds with Alternate FFT Algorithms
Codec2 operates nicely in real-time on a modern-class laptop, but I've been trying to push it into other devices. Real-time encoding and decoding on PIC32s will need to wait for a 16 bit integer version (which can make use of Microchip's DSP library), but how about ARM hardware? I ported the code to a Nokia N800 (400MHz ARM11), and it works nicely, but the decoding is choppy after a while because it just can't spit out audio in real time. Decoding will need about a 2x speed up. I've been able to get pretty close to that by swapping out Dave's Fast Fourier Transform function with a different one licensed under the GPL.
Codec2, being a sound codec, makes frequent use of FFTs. Here's the result of gprof running over c2dec:
With a bit of work, I swapped in the kiss FFT code. This is by no means top-notch in the speed derby, but it should be better. It has the advantage of being small and simple. Here's the same process, on the same machine using that code:
On decode, anyway, FFT remains the big time-hog. We can improve it further, by about another 50%, if we treat all transforms as entailing real numbers. kiss FFT has a real number routine on board, but I'm having trouble interfacing this with Dave's existing code. Something's just not aligning properly.
Finally, the gold standard in C code FFT speed appears to be the FFTW library. Implementing this should improve matters greatly again, although its interface seems a bit hairy. After that, it's hardware DSP code. The N800, for example, actually has a DSP engine which could be engaged to this purpose quite well, but I can't find if or where a FFT routine is exposed usefully. In any case, the N800 is a pretty old machine: what barely works on it should do just fine on more recent hardware.
Codec2, being a sound codec, makes frequent use of FFTs. Here's the result of gprof running over c2dec:
% cumulative self self total time seconds seconds calls us/call us/call name 83.52 3.65 3.65 28116 129.82 129.82 four1a 7.78 3.99 0.34 11246 30.23 160.05 aks_to_H 3.43 4.14 0.15 5623 26.68 156.50 aks_to_M2 2.75 4.26 0.12 599720 0.20 0.20 sample_log_amp 1.37 4.32 0.06 11246 5.34 135.15 synthesise ...You can see that four1a, the FFT routine, is taking up 3.65 seconds, and the total turns out to be 4.37! So the FFT is a great candidate for improvement, and happily the current code uses a very slow FFT, the one from Numerical Recipes in C.
With a bit of work, I swapped in the kiss FFT code. This is by no means top-notch in the speed derby, but it should be better. It has the advantage of being small and simple. Here's the same process, on the same machine using that code:
% cumulative self self total time seconds seconds calls us/call us/call name 49.43 1.30 1.30 28116 46.24 46.24 kf_work 15.97 1.72 0.42 11246 37.35 92.12 aks_to_H 9.13 1.96 0.24 28116 8.54 54.77 four1 6.46 2.13 0.17 5623 30.23 85.01 aks_to_M2 6.08 2.29 0.16 11246 14.23 69.00 synthesise ...Total time turns out to be 2.63 sec. We have to be careful to charge both kf_work and four1 against FFT, because the 'four1' routine is my bridging code. Still, this is quite the improvement.
On decode, anyway, FFT remains the big time-hog. We can improve it further, by about another 50%, if we treat all transforms as entailing real numbers. kiss FFT has a real number routine on board, but I'm having trouble interfacing this with Dave's existing code. Something's just not aligning properly.
Finally, the gold standard in C code FFT speed appears to be the FFTW library. Implementing this should improve matters greatly again, although its interface seems a bit hairy. After that, it's hardware DSP code. The N800, for example, actually has a DSP engine which could be engaged to this purpose quite well, but I can't find if or where a FFT routine is exposed usefully. In any case, the N800 is a pretty old machine: what barely works on it should do just fine on more recent hardware.
Comments
Post a Comment