Welcome back !! So we are done with our first phase of the project and are shifting into the second one. I will be keeping this blog short since most of the details of the refactor have already been written in my previous post.
By the time I was writing my previous post I had a pretty decent idea of how I would be doing each of the refactors. We had already decided that we may not have to implement all of them because Vaex might render a few of those changes redundant.
I started out by writing a proof-of-concept to remove the column where partition function was added. Only the case of equilibrium molecules was handled here. The idea was to make use of pandas’ dictionary efficiently and remove the column. With the proof-of-concept we could conclude that not only did this approach reduce memory, but it also reduced CPU pressure by around 2x. For the lines of
HITEMP-CH4 molecules for the waverange 2000-3000 previously the dataframe occupied 1.2 GB but with this method we could compress that to around 100 MB. 1
Apart from this I wrote down another notebook that demostrated that we can radically improve memory usage by crunching the datatypes of the columns of
HITRAN/HITEMP molecules. The notebook just contains elementary operations to arrive at the right datatype for each of the column. We haven’t implemented this into the codebase yet because we still haven’t figured out what we will be doing with the missing lines. A problem I had already mentioned in my first post. 2
I was somehow able to sneak my way into successfully completing GSoC phase one with feedback that has pumped me to do even better. I am looking forward to the second phase and hope to deliver.