Hey there, welcome to the second blog of the series, and the first one to document the coding period. The Community bonding period which I described in my previous blog ended on 31st May and paved the way for the official coding period of the Google Summer of Code. These past two weeks were my first where I spent most of my time working on the actual code that will be a part of my project. My primary objective over these two weeks was to study the proof of work code that implements the spectral matrix algorithm to compute the spectra and execute it on a GPU. This was followed by a period of studying the different mechanisms with which RADIS calculates the spectras, and to understand the differences between each of them. This was important as implementing GPU compatible methods for all these distinct pipelines is my final objective and it is essential for me to understand the differences between these methods at the very onset of my project. Finally, the remaining time was spent on back and forth discussions with my mentors on various languages and libraries that could have been possible choices for undertaking this project. Once we had made our decision, I spent the time going through the library’s documentation, source code and tutorials to familiarize myself with these tools.
The first major objective for these two weeks focussed on studying and executing the proof of work code. This was a single CUDA C file which demonstrated the idea of using a spectral matrix to compute a spectra while making use of a GPU could offer performance boosts of multiple orders over the naive methods. I initially planned on executing the code and running it on my personal computer, but the idea was quickly dismissed because of reasons I already discussed in my previous blog. As a result, I ended up using Google Colab for this experimentation which came with its own fair share of discomforts. The first, and most significant of which, was the lack of persistent storage on Colab and thus being forced to resort to Google Drive for saving our database instead. This was costly in terms of both, the time it took to store the data on the cloud and also on the overall performance of the code as the time taken to load the data to memory increased significantly compared to a single CPU-GPU system like my personal laptop. This however, was not detrimental to the fundamental objective as the benchmarking could be done for each part of the code separately, and thus it did not influence or affect the execution of the device code or its perfomance in any way. Another task which popped up when using Colab to run CUDA was to setup the system so it could run native CUDA C files along with the Python code as well. This fortunately was not very difficult to solve and a couple of google searches gave us the list of all the necessary packages we needed to compile and execute C files on Colab. Once that was set up, the only thing that was left for me to do was transfer the data from my laptop to Google Drive. This once again posed a problem that I had not anticipated. Uploading 8GB of data takes much longer than downloading the same amount of data! As soon as that realization hit me, I decided to adopt another approach. I copied the code that I used to download the data from the FTP server to my local storage and ran it on Google Colab! This allowed me to once again redownload the entire data (which in the raw format was ~ 30GB) directly on my Drive instead. The process was much faster than I had anticipated and I soon had the raw data on my Drive. After running another couple of scripts to format and repartition the data into separate numpy arrays, I was ready to go. Execution of the code went smoothly except for a few hiccups surrounding the matplotlibcpp library that was being used to plot the output spectra. I wasn’t able to solve this problem immediately like the others and talked to my mentors about it. They advised me to not worry too much about it right now as it really wasn’t the critical part of the project. The major part, the kernel that was supposed to run on the GPU ran as expected and the results we obtained by timing the kernel performance were very positive! Now that I had successfully executed the code, what followed was a series of different runs of the same code, only this time with a different aim to test how far we could take this GPU compatible code. To give some numbers here, the original proof of work code that crunched the 8GB processed database computed a total of 240 million lines in less than a second! To be more specific, it took 120 ms on average to achieve that number. To put that into perspective, a naive implementation of the same code, that does not make use of the optimizations we did here, would take 10,000x longer to produce the same results! That in itself makes the naive approach an impractical solution to the problem. Compared to the current RADIS implementation, the performance gain was still significant with upto 50x gain in terms of time spent for computing the spectra. In order to see how far we could take this code, we also tried it with it a bunch of different ranges from the same dataset. While the original code was tested on a range that spanned from 1750 to 2400 cm-1 wavenumber, we took it as far as 1250-3050 cm-1. Surprisingly, the code scaled pretty well with the increase in the number of lines being computed, going from the original 120 ms taken to compute 240M lines to ~ 220 ms to compute 330M lines. Testing such a wide range and getting such positive results was sufficient proof for us to pack up the analysis part and move on to the actual implementation.
Read more…