Google Summer of Code - Blog #3!

Hey! Continuing from where we left off last time, in this blog I’ll be talking about the work I did in the period between 1st - 15th July. As I had mentioned in the previous blog, my primary task for this period was to get the Cython/CuPy version of our proof-of-work code to produce the correct output. I had initially expected the work to be relatively easy, as the output was already there; I simply had to check the logic part of the program to identify why it wasn’t correct. However, the work actually turned out to be a lot more painful that I’d have imagined. I had briefly touched upon this fact in the last blog as well, but the first thing I had to do in order to finish this task was to ensure that the debugging tools and methods were in place. This, fortunately, was one of the easier parts of the problem. The problem was two fold, and thus required 2 different solutions as well. First, the loading of the dataset on to the RAM was a challenging task for my computer. I am not sure exactly why that was the case, but it certainly was nowhere as performant as the C++ code which did the same task of loading the exact same data in a relative breeze. I looked around the internet to understand what could possibly be slowing down the loading step so much, which was quite simply a single line : v0 = np.load(dir+path+'v0.npy'). The file itself was around 400MB, and there were a total of 8 of them, pushing the total memory required to around 3 GB. The solution to this problem was relatively straight forward, and felt almost as if it was hiding in plain sight when I did find it. The core idea is that when we try and load a numpy array the way I was doing without having declared the variable previously as a c-type, Cython quite naturally assumes it to be a pure Python variable and therefore fails to deliver the performance boost it promises on compilation. This however, was a trivial issue to resolve. All I had to leverage the advantage promised by Cython was to declare the numpy arrays prior to loading them with the datasets. This was done with a simple line cdef np.ndarray[dtype=np.float32_t, ndim=1] v0 = np.zeros(N_points, dtype=np.float32). That was it! With just this single addition, the array was now a c-type variable and thus was processed significantly faster than the older pure numpy arrays. Unfortunately I didn’t benchmark the difference as I still have some things which I am not super confident about related to this part. I am not sure if it’s due to an observation bias or some other external factor, but I felt that the speed of loading the data itself varied quite significantly even with the same binaries. I am not sure if this is due to some caching/optimizations being done under the hood by the compiler itself, but whatever it is, certainly would make aimless benchmarking without controlling these external factors a futile exercise. The second issue which I faced which was making debugging difficult was the inability of my GPU to automatically kill the Python/CuPy tasks once the program finished execution. I searched around stackoverflow and found that it is actually a rather common issue with CuPy. As a result, it also didn’t take a long time before I found a makeshift solution for this problem as well. All I had to do once the program had finished execution was to call some specific CuPy methods to free the memory, and it worked just fine! With these two issues sorted, I had a much better setup in place to try and debug the code without being forced to restart the computer or wait 10 minutes for the RAM to clear up everytime the program finished execution!

Now moving on to the actual issues in the code! It’d be an understatement to say the code wasn’t riddled with a number of bugs. As some of you might remember, I explained in the previous blog how we had decided to not worry about trying to fix variables into structs before passing and instead adopted a slightly different approach of using each attribute as an independent variable. While this wasn’t such a bad idea, and infact allowed me to proceed quite quickly, it was annoying with respect to how much had to be typed just to pass a single variable. Everytime you need to call the variable, you’d have to ensure the variable had been specified with the global identifier before being used. Naturally, all this led to a huge scope of facing some issues and I certainly did. After writing the kernels and compiling the code, I tried to run the program for the first time. The initial few errors I faced were trivial bugs that were solved almost instantly. But the first major problem that I ran into was an IllegalMemoryAccess error that was thrown by CuPy/CUDA. As the name suggests, this was happening because at some place in my kernel, I was trying to access an address in the memory that just hadn’t been declared. The problem in debugging this issue was that I had no idea which memory load operation was throwing the error, neither the variable nor the line number. This proved very difficult to debug because of this reason, along with the fact that I simply couldn’t get the kernels to print anything on the terminal using CuPy. I was probably a few days into this problem before I approached my mentors to discuss about it. While my mentors were very helpful and went through the code and provided feedback on a lot of other aspects of the code which had room for improvement, none of their suggestions were aimed directly towards the problem itself. We had a few places we knew we could try to inspect and check if they’re the source, but it’d have been a difficult process. Fortunately, though, something really unexpected happened at this time. Dirk found out a way to get the structs to work in CuPy, as well as the syntax to save them in GPU’s constant memory. While this didn’t directly affect the kernel code itself, it did make the code a lot neater and easier to inspect. However, the biggest advantage wasn’t that! For some weird reason which I am still not sure of, the modification in the code to store variables under structs and saving the two constant structs in the constant memory somehow solved the illegal memory access!! I was so stunned the first time it happened. I had been banging my head against the wall for over a week because of that issue, and somehow making another adjustment in the code which I thought was completely unrelated to the issue resolved it instantly. The feeling was so satisfying it’s hard to put it in words!

However, the battle was still not over. Although the code went on to execute till the end after this error was resolved, it still failed to produce the correct output. I was getting an output, an array which ideally should have been the spectrum produced according to the conditions specified and the data passed, but it just didn’t match with the output of the C++ program. More than just a mismatch in the values, it felt particularly bad since the values I was getting as the output were a mixed bunch of both positive and negative numbers, which especially didnt make any sense as the output was supposed to represent the intensity, so a negative value had no real physical meaning. Once again I was stuck in another long arduous cycle of trying to identify what’s going wrong in those 1000 lines of code. However, this time I was determined to solve it myself and thus started the debugging process in the most rudimentary manner possible. I decided to follow the execution of the program, one step at a time and print the values obtained till that step. Sounds pretty simple, doesn’t it? Except the problem was this: the valus that were being processed and passed on to the next methods in this sequence of steps weren’t separate primitive values like floats or ints, but instead arrays with more than 50,000 floats or at times, even millions of them. Thus while the idea was pretty simple, the execution was just as exhausting. I had to print the values not just for the Cython code I had written, but also for the C++ program at each step, since I didn’t have anything else to refer too. This was just as cumbersome as it sounds. After printing the thousands upon thousands of values in text files and saving them, I was also left with the task of checking them to see if they match or not. Again, since these were too many rows to compare directly, I had to write scripts in Python to do just that: load arrays with hundred thousand elements from text files, and compare them. While the process wasn’t difficult in itself, it was what people might call monotonous and extremely draining since each step took time and almost ironically, couldn’t be carried out in parallel as I just didn’t know at what stage the error creeped in. Thus, after hours and hours of executing the program, saving the file and comparing the values, I was able to narrow down my search to a single method in the huge file. Infact, I knew exactly where the error was. It was the output array from the first kernel, which we called host_params_h.DLM_d_in. Now that I knew where the error was, I was sure that the error originates from somewhere in the first kernel as it failed to produce the correct output. Thus, I once again started this whole process of comparing the variable values one at a time until I could see where the difference lies. However, this time the search failed to bring any results as literally every single input variable for the kernel was identical to the values that were being passed to the C++ kernel. There was no chance of the kernels behaving differently in the two programs as the code was literally the same for both. And with that I hit a dead-end. Again lost with no ideas left to pursue, I approached my mentors and they offered a lot of suggestions as to what might be going wrong, and once again it was something which they suggested that resolved this issue. Turns out, the array that we were using in the program, which was a three-dimensional numpy array, was not being read properly by the kernel. So as all you might know, even though we often use multidimensional arrays in our programs, the memory in the computer is arranged in a single dimension. Therefore, multidimensional arrays bring with them the question of how to store them in the memory. This exact idea was the logic behind the painpoint in my code. While numpy arrays that we defined in the Cython portion of our code was for all purposes normal 3D arrays which could be indexed using the standard arr[i][j][k] notation, under the hood things were being done in a different manner than we thought. However, once we realized that this was the source of the error, it was quite easy to resolve by simply specifying, during the declaration of the array, the type of ordering we want for our elements in the memory. In numpy, there are two options for order, ‘C’ and ‘F’. C refers to the style in which C stores multidimensional arrays in memory, in a ‘row-major’ fashion while F refers to Fortran, which instead adopts the ‘column-major’ option. Now that this was over, I was expecting, as you might be expecting, that everyone was happy and working exactly the way we were expecting it to. But this problem was still far from over! I had successfully managed to get the values of host_params_h.DLM_D_in to match, and yet the final output didn’t match with the expected value. At this point I was extremely frustrated since the mismatch in the values meant that the error was being generated somewhere in the 10 odd lines of code that were left after the first kernel’s execution. Out of those few lines, a couple were function calls to the fourier transform libraries inbuilt in CuPy, and I was certain that couldn’t possibly be the source of error. The only possibility left was the second kernel call. However, unlike the first kernel, the second kernel was fairly straightforward and it wasn’t clear what could possibly be going wrong in those 15 lines or so. This also went on for a few more days until one day my mentor, Dirk, suggested that since we are so close to the final working code, that we do a live-collaborative coding session to fix the final problems and be done with this. I couldn’t have asked for more! For me who had been struggling with these last few annoying bugs for so long, Dirk offering to help me out over not just over text but on a video conference was a silver lining. We quickly decided on a date and were ready to squash those bugs. While we made some progress in the first hour or so, eventually we also hit a dead end where neither of us could figure out what could be going wrong. And then, with a stroke of luck that some might consider divine intervention ( :P ) the strange thought of trying to compute the FFT on the CPU came to our mind. It was something that we thought of due to pure desperation to do something, to check something. At that point we’d done everything we could think of to see where the bug came from but it was to no avail. I had absolutely zero expectations from the change in FFT that we were trying to accomplish, for all we did was change the processor on which the FFT was being computed. Instead of computing it on the GPU using CuPy, we did it directly on the CPU using Numpy. And once again, as the program executed and produced the correct shape on the plot produced in the output, we were left speechless. We had absolutely no idea what had happened. The change that we’d made a few moments ago was definitely the last thing I could have thought of as responsible for the bug, but somehow, that was it. But apart from the surprise, we were also happy and relieved to have finally identified the source of the error. It was either the forward FFT or the inverse FFT. Another hit and trial experiment narrowed it down further and we were sure that at this point, it was the forward FFT which was the source of error. However, we still couldn’t comprehend why that might be the case. We did a lot of googling, a lot of playing around with np.fft.rfft but it didn’t help. Even though we knew the problem was somewhere in the way the data was being read (as we’d previously noted during the other indexing issue), we didn’t know why or how it was that way. Effort to print the order of the array didn’t work as the numpy class apparently does not have an attribute called order, and instead it is something which can only be passed during the declaration. After possibly sitting there for maybe another hour or so, Dirk came up with something absolutely amazing. In order to check the way the data was being mapped to the memory without having something like order to print, Dirk found out and suggested the use of flags. Flags was infact the attribute we were looking for, containing all the information about the way the data was being mapped to the memory. A quick look at the flag from the output of the FFT confirmed our suspicions. The returned array was F-continous, but not C-continous. What this meant was that the data was being stored in a column major format, while the kernel had been written for C data. It was an easy issue to resolve once the error itself was clear and within a minute, after changing the indexing in the kernel body, we were done! The code compiled without any hiccups, the binaries ran without any issues and out came within a minute the plot I had been waiting for! It was a great moment to be very honest, seeing the work I had been doing for the past month or so finally working the way we’d always expected it to. With that, I finally had a working Python-compatible version of our code to produce spectra using GPUs.

Now that we’re done with this part of the project, the upcoming weeks will be spent on making this work with RADIS. While the code, the way it is right now, is capable of calculating the spectra by itself without the need or support of any external code except the modules we use, it might not remain in such an independent form after its integration with RADIS is complete. I am not very certain about the details of this part of the project yet, but I will be having a meeting with my mentor, Erwan soon to discuss it. In the next blog, I think you guys can expect some discussion on the integration part of this project along with the details of how RADIS will start supporting GPU compatible methods! With that, I would like to conclude this blog! Disqus is always open for feedback and ofcourse, thanks for your time!