For my CIS 565 ”GPU Programming and Architecture” (University of Pennsylvania, Spring 2012) final project, I had decided to try working an a project that would not only demonstrate a practical application of parallel computing, but also become a useful side-research for my main pet project, Beats. The project I had in mind was to apply GPU acceleration and parallel processing techniques to improve the time spent on the BPM calculation algorithm used in the simfile-pattern-generating MATLAB program, Dancing Monkeys, compatible with DDR simulators such as StepMania and, of course, Beats. Improving the BPM calculation step was of great benefit as a one-minute song, for example, would often take almost two minutes to process. Reducing that processing speed would result in lower user wait time and, if drastic enough, potentially lead to real-time pattern generation. Together with my partner Yanjie Feng, we called the final project, Dancing Monkeys Accelerated.
The end result was somewhat surprising. Turns out, due to the nature of the BPM calculation algorithm used (O’Keefe implements a brute-force algorithm similar to the one described by Will Archer Arentz’s paper, “Beat Extraction from Digital Music”), parallel processing techniques were ineffective as the ratio of access operations to arithmetic operations was too high (leading to additional data look-up overhead). Approaches attempted included MATLAB’s Parallel Computing Toolbox’s parfor, and the third party Jacket plugin.
On the other hand, I was able to achieve an astounding 10x speedup purely through hand-optimizing the original algorithm (who would have thought that unrolling three simple branches could lead to such drastic improvements!). In the final test, the processing time of a 90s song went from 175s to under 20s!