Google Summer of Code - Blog #2!

Hello! This blog marks the end of the first phase of Google Summer of Code. The journey so far has been challenging but also extremely rewarding. The knowledge gained as a by-product of the work I’ve been doing on my project so far is unbelievable, but more importantly has been a better, more pleasant experience compared to the traditional system of gaining knowledge by reading books and tutorials. In this blog, I will be summarising the work that I’ve been doing for the past 2 weeks, update the readers on my current position and give an idea of what lies ahead.

As I discussed in my previous blog, the work for the first two weeks mostly involved using Cython to translate the host code into something which could be compiled to provide better performance. More importantly, apart from the Cython part, I had to also start working on porting the kernels from their pure CUDA C form into something which Python/Cython could also understand. While the actual work to do so did not take long or came across as very challenging, the most difficult part in the whole process, undeniably, was to get the two, Cython and CuPy to talk to one another.

When I finished writing the previous blog, I was still left with quite a large volume of host code which was waiting to be converted to the Cython equivalent. Going through the previous blog, I’d like to make one correction in the part where I said I found Cython quite confusing. Infact, calling it a correction would not be correct. Instead, I should say that Cython isn’t actually all that difficult as I might have made it look like in the last blog. After a fair amount of reading and writing code in Cython, I think the developers of Cython have actually done an excellent job. After the initial barrier is crossed (and it happens fairly naturally), writing Cython code feels just as normal and second nature as pure Python code. Infact, I’d go so far as to claim that Cython feels more natural to me now than Python itself because of my previous experience in C++. This newly found familiarity allowed me to proceed quite quickly with this part of the project. The only problems that I encountered were: (1) the inter-conversion between arrays/vectors and pointers that we are used to in C++ is not possible in case of Cython, and this resulted in me being forced to make some slight changes in the host code, and (2) using structs in Cython isn’t as direct and straighforward as in C/C++. The reason behind this is the fact that Cython tries to estabilish relationships between C-type data structures and Python objects. While this is quite trivial for objects of types int, float, char, and even arrays to some extent, in case of structs this is nowhere near as easy. While I am sure there must have been some hacky way around this problem too, me and my mentors decided it is not something worth wasting time on, and thus we decided to bypass all structs and directly used each of their attributes as variables with the struct name attached as prefix, so host_params_h.shared_size became host_params_h_shared_size. While not the exact same thing, it allowed us to achieve the same objective without a lot of modifications to the code, either in terms of declaration or syntax. The only downside to this whole approach was that it made the code quite verbose, as instead of passing a single struct with 10 fields inside it, we were forced to pass 10 variables for each struct. This unique problem was further aggravated as we were using global variables instead of passing them around as arguments, and as every Python user would know, this meant adding the line global <varname> before every function body, which when done for every method and every variable, meant a lot of lines which could have been avoided. Apart from these two major issues and couple of minor problems here and there, the whole process was fairly straight forward. At the end of this step, I was left with a Cython file which compiled just fine, but didn’t really accomplish much. The key ingredient that was missing, was the very the heart of the project: the kernels.

Kernel is actually nothing but a fancy word for the part of the code which actually executes on the GPU. Given the project title, it’d be fairly obvious that in our project, the kernels are actually where the magic happens. While this is not meant to discount the importance of the host (or the CPU code), the kernels are ulimately the part of the program which are responsible for the performance boosts that we observe. Kernels, atleast those which are meant to be executed on Nvidia’s GPUs, are usually written in a language known as CUDA C. This is a special language that is written on top of the original C language, but with extra set of features, classes and methods which provide us an abstracted interface to control various aspects of the program and the way it is implemented on the GPU, more so than a conventional serial algorithm meant to be implemented on the GPU. While using CUDA C is quite straightforward, especially with the large community support and well-written documentation, we unfortunately could not make use of that as we wanted something that was written and compatible with Python and things written on top of Python. Thus, after a lot of deliberation and discussion, my mentors and I agreed to use something known as CuPy to handle the CUDA C part of our code. CuPy is an incredibly well-written module with neat documentation and decent community support, which made things a lot simple for us. However, more than anything, the biggest advantage of using CuPy was its RawModule method. The idea behind RawModule was to allow users who already have a CUDA C file written to do some specific task (us!) could simply re-use their code and get away with the whole problem of running kernels in Python very, very easily. Let me demonstrate it using an example and that would perhaps make things even more clear:


loaded_from_source = r'''
extern "C"{
...
__global__ void test_sum(const float* x1, const float* x2, float* y, \
unsigned int N)
{
unsigned int tid = blockDim.x * blockIdx.x + threadIdx.x;
if (tid < N)
{
y[tid] = x1[tid] + x2[tid];
}
}
}'''

module = cp.RawModule(code=loaded_from_source)
ker_sum = module.get_function('test_sum')
N = 10
x1 = cp.arange(N**2, dtype=cp.float32).reshape(N, N)
x2 = cp.ones((N, N), dtype=cp.float32)
y = cp.zeros((N, N), dtype=cp.float32)
ker_sum((N,), (N,), (x1, x2, y, N**2))   # y = x1 + x2

While the code looks fairly self-explanatory, I’ll give a quick runthrough anyway. The idea behind CuPy’s RawModule method, true to it’s name, is to allow the raw CUDA source to work with Python, which in our case is the string named loaded_from_source. Using RawModule, we process the string as a CUDA C source file, extract the relevant kernels from the file, test_sum in our case, and we’re done! The function is stored as ker_sum in Python, and is ready for use just like any other Python method. In order to keep the look and feel of the module as close to the original CUDA C source codes, even the way kernels are called is quite similar to how they’re called in C. The first two parameters, are the grid and block dimensions, and finally we follow through with the actual parameters to the kernel. Clearly, this allowed us to fast-track a lot of the kernel porting work, and quickly develop a Python version of our original proof-of-concept code. However, like everything else, even this wasn’t going to work as easily as we’d initially expected. Instead, I faced a new challenge, which was the use of a struct written in the CUDA library, used for fast fourier trasforms, known as cufftComplex and cufftReal. Like I explained in the previous paragraph, this problem was quite similar in nature to the whole Python object to C object transformation and vice versa. If anything, this probably felt more explicit in nature since we could clearly see the actual C code written (as the RawModule’s string). The problem was that the structs are very specific to their definition, and it is simply not possible to pass anything to the RawModule and expect it to process it as a struct. Even though a numpy array of datatype complex64 might be storing the exact same data as an array of cufftComplex, the two cannnot be interconverted like vectors of integers and other primitive types can be. This again posed a challenge as it would mean a deviation from the original code. Finally, after reading a lot of stackoverflow and still not being satisfied with any of the answers, I let my mentors help me out with the code. The final solution that came out did a couple of things: first we got rid of the cufftComplex and cufftReal structs, and instead introduced the other, C datatype, serving the same purpose, a complex<float>. This amazing datatype did the exact same thing, except we could pass a numpy (or cupy) array of type complex64 and it would automatically read it as a complex<float> array! It quite literally solved the whole problem in an instant, and the only modifications we had to make were change how the real and imaginary parts of the complex numbers were being handled. While the struct definition required us to handle both separately, the new format made things even simpler by letting us perform calculations on both the real and imaginary parts simultaneously (basically how you’d expect to work with a complex number anyway!). With this, and again a few minor changes here and there to ensure the data transfer between Cython and CuPy wasn’t throwing errors, I was done! However, by now I had a Cython file with 1300+ lines with a lot of room for bugs and unexpected errors and behavior.

This brings us to the current time. The current objective for me is to get the file to work properly and produce the right about, so basically debugging. However, unlike smaller programs and files which are debugged with usually a single method in focus, I have to constantly let the whole chain of methods to execute even if I know that the bug is in some specific method, simply because its impossible to recreate the testing environment otherwise. For instance, I have been debugging the code since yesterday, and except one time where I got a segmentation fault, every time the program crashes, I am being forced to restart my computer just to start the debugging process again. Reason you ask? The program is working on a small subset of an already trimmed dataset, occupying around 3 GB of space. However, whenever I execute the program and try to run it, it is loading all that data first in the RAM (which slows the computer to a crawl almost instantly due to the limited 8GB RAM), and subsequently to the GPU (where it takes up 3/4 GB VRAM present). On force closing the program, while the RAM does free up after some time to a state where I can start using the computer again, the GPU does not!! I am yet to read into the nitty-gritty of this, but from what I understood, we need to explicitly clear the VRAM occupied by CuPy using methods given in the documentation. However, when the program does not work as expected (which is currently 100 percent of the times I have run it), the program simply crashes before executing the lines which free up the memory. The result? A GPU which is loaded up and unable to free its VRAM. Resetting seems to not work for some reason, and the only option I have found so far (admittedly I havent researched well enough but its only been 24 hours since I reached this stage) which does work is restarting my computer. This, as I am sure you can already feel, is a very frustrating way to debug things, but I am happy to say that I am still making progress, albeit a little slowly than I would have liked to.

That pretty much sums up everything that I have been upto for the past 2 weeks. The progress is a little on the slower side, as I expected to have the demo working and producing the correct output by now, but unfortunately the code still needs debugging. Hopefully this should be over soon, and we can then move to integrating it with RADIS and making changes that should allow the user to make use of our program. After that, we would be focussing on implementing other methods of calculating spectra, and possibly also methods which support non-equilibria conditions. Hopefully I should have a lot more to tell you guys 2 weeks from now! Till then, adios! And thanks for making it this far (if you actually did so :P) Cheers.