.entry-views-count{ display: none !important; }

Dancing Monkeys: Accelerated

For my CIS 565 ”GPU Programming and Architecture” (University of Pennsylvania, Spring 2012) final project, I had decided to try working an a project that would not only demonstrate a practical application of parallel computing, but also become a useful side-research for my main pet project, Beats. The project I had in mind was to apply GPU acceleration and parallel processing techniques to improve the time spent on the BPM calculation algorithm used in the simfile-pattern-generating MATLAB program, Dancing Monkeys, compatible with DDR simulators such as StepMania and, of course, Beats. Improving the BPM calculation step was of great benefit as a one-minute song, for example, would often take almost two minutes to process. Reducing that processing speed would result in lower user wait time and, if drastic enough, potentially lead to real-time pattern generation. Together with my partner Yanjie Feng, we called the final project, Dancing Monkeys Accelerated.

See the full paper at http://keripo.com/static/academia/upenn/cis565/project/2012-04-24%20Dancing%20Monkeys%20Accelerated.pdf
See the full paper here

The end result was somewhat surprising. Turns out, due to the nature of the BPM calculation algorithm used (O’Keefe implements a brute-force algorithm similar to the one described by Will Archer Arentz’s paper, “Beat Extraction from Digital Music”), parallel processing techniques were ineffective as the ratio of access operations to arithmetic operations was too high (leading to additional data look-up overhead). Approaches attempted included MATLAB’s Parallel Computing Toolbox’s parfor, and the third party Jacket plugin.

On the other hand, I was able to achieve an astounding 10x speedup purely through hand-optimizing the original algorithm (who would have thought that unrolling three simple branches could lead to such drastic improvements!). In the final test, the processing time of a 90s song went from 175s to under 20s!

The following are snapshots of the final demo presentation and transcript. For more details, see the links below:
Blog
Final paper
Demo video (ppt)
Code (snapshot)

1 Hello, today I will be presenting to you our CIS 565 final project, GPU-Accelerated Beat Detection for Dancing Monkeys, by Philip Peng and Yanjie Feng
1 Hello, today I will be presenting to you our CIS 565 final project, GPU-Accelerated Beat Detection for Dancing Monkeys, by Philip Peng and Yanjie Feng
2. Dancing Monkeys is an open source MATLAB program for creating step patterns for the music game, Dance Dance Revolution. This is accomplished through employing a highly precise beat detection algorithm. Unfortunately, the algorithm is extremely slow, so we thought we would try to use the GPU to parallelize the brute force BPM comparisons.
2. Dancing Monkeys is an open source MATLAB program for creating step patterns for the music game, Dance Dance Revolution. This is accomplished through employing a highly precise beat detection algorithm. Unfortunately, the algorithm is extremely slow, so we thought we would try to use the GPU to parallelize the brute force BPM comparisons.
3. Our first test was to try using MATLAB's Parallel Computing Toolbox to run the code on multiple threads on the CPU. This required some code modification and re-arranging such that the data could be manipulated in a manner that fit with the parfor indexing requirement.
3. Our first test was to try using MATLAB’s Parallel Computing Toolbox to run the code on multiple threads on the CPU. This required some code modification and re-arranging such that the data could be manipulated in a manner that fit with the parfor indexing requirement.
4. The result was a pretty significant speed boost, with speed improvements as the parallel worker count increased up until 4 workers (since the test computer had 4 physical CPU cores).
4. The result was a pretty significant speed boost, with speed improvements as the parallel worker count increased up until 4 workers (since the test computer had 4 physical CPU cores).
5. Our next test was to try using MATLAB's GPUArray extenson to the Parallel Computing Toolbox. This had more restrictions and far less flexibility than Parallel for
5. Our next test was to try using MATLAB’s GPUArray extenson to the Parallel Computing Toolbox. This had more restrictions and far less flexibility than Parallel for
6. Unfortunately, it turned out that GPUArray was not suitable for this project as the algorithm operates on large amounts of shared data and the Parallel Computing Toolbox does not support global variables on the GPU.
6. Unfortunately, it turned out that GPUArray was not suitable for this project as the algorithm operates on large amounts of shared data and the Parallel Computing Toolbox does not support global variables on the GPU.
7. Our final test was to try using the third-party Jacket plugin for MATLAB. Not only does Jacket support direct data copying to and from the GPU, but it also supports a far greater range of functions. Code modification was similar to that required of parfor with a few more restrictions.
7. Our final test was to try using the third-party Jacket plugin for MATLAB. Not only does Jacket support direct data copying to and from the GPU, but it also supports a far greater range of functions. Code modification was similar to that required of parfor with a few more restrictions.
8. Much to our surprise, Jacket turned out to have actually reduced overall performance compared to the original baseline.
8. Much to our surprise, Jacket turned out to have actually reduced overall performance compared to the original baseline.
9. So came the imminent question - why was it slower on the GPU?
9. So came the imminent question – why was it slower on the GPU?
10. To answer this question, we had to first take a closer look at the types of operations that constitute the algorithm: array initialization, element access and assignment, element arithmetic operations, and array operations. Of these four, element access calls were the most prominent.
10. To answer this question, we had to first take a closer look at the types of operations that constitute the algorithm: array initialization, element access and assignment, element arithmetic operations, and array operations. Of these four, element access calls were the most prominent.
11. Direct timing comparison tests were run with the following results. Element operations for Jacket showed generally good performance improvements over their CPU counterparts; however, the breakpoint for element accesses was very high.
11. Direct timing comparison tests were run with the following results. Element operations for Jacket showed generally good performance improvements over their CPU counterparts; however, the breakpoint for element accesses was very high.
12. The array operations for Jacket showed significant speed improvements for init, mod, and sort; however, the frequency of these operations in the algorithm were very low.
12. The array operations for Jacket showed significant speed improvements for init, mod, and sort; however, the frequency of these operations in the algorithm were very low.
13. Based on the timing comparison results, there are two potential reasons why the Jacket implementation was extremely slow. The first is that the size of the data being operated on was too small, far less than the break-even points for most of the operations. The second reason is that the algorithm itself uses a LOT of scattered array accesses, leading to an extremely high overhead cost.
13. Based on the timing comparison results, there are two potential reasons why the Jacket implementation was extremely slow. The first is that the size of the data being operated on was too small, far less than the break-even points for most of the operations. The second reason is that the algorithm itself uses a LOT of scattered array accesses, leading to an extremely high overhead cost.
14. Upon concluding that the current algorithm itself would not benefit from GPU parallelization due to the previous reasons, we tried improving the original code, such as rewriting parts of the algorithm to reduce branching and conditional statements, such as shown here.
14. Upon concluding that the current algorithm itself would not benefit from GPU parallelization due to the previous reasons, we tried improving the original code, such as rewriting parts of the algorithm to reduce branching and conditional statements, such as shown here.
15. Surprisingly, this lead to very drastic speedups, reducing the overall time to a tenth of the original time. One interesting note is that applying both parfor and gfor to the tweaked code only resulted in longer times, meaning that parallel CPU and GPU overheads outweighted any possible parallelization benefits in the modified algorithm.
15. Surprisingly, this lead to very drastic speedups, reducing the overall time to a tenth of the original time. One interesting note is that applying both parfor and gfor to the tweaked code only resulted in longer times, meaning that parallel CPU and GPU overheads outweighted any possible parallelization benefits in the modified algorithm.
16. So overall, we were able to draw a few unexpected conclusions from this project. First, the original algorithm's small data size and high percent of access calls meant it was not suitable for GPU parallelization as originally though. Second, Jacket can offer significant speedups, but these speedups were not realized in this project due to the previous reasons. Third, the original code of the algorithm itself was poorly optimized and the rewritten code was optimized to the point where there could be no benefits from GPU parallelization.
16. So overall, we were able to draw a few unexpected conclusions from this project. First, the original algorithm’s small data size and high percent of access calls meant it was not suitable for GPU parallelization as originally though. Second, Jacket can offer significant speedups, but these speedups were not realized in this project due to the previous reasons. Third, the original code of the algorithm itself was poorly optimized and the rewritten code was optimized to the point where there could be no benefits from GPU parallelization.
17. That concludes our demo. If you have any questions or want to know more, check out our blog linked below, hosted on Blogspot. The source code to the project is also linked below, hosted on Github.
17. That concludes our demo. If you have any questions or want to know more, check out our blog linked below, hosted on Blogspot. The source code to the project is also linked below, hosted on Github.

~Keripo (Philip Peng)

Check Also

Anime Expo 2019 – Part 3: Azur Lane

Post Views: 1,297 This is Part 3: Azur Lane of my series of blog posts …