GSoC - 0

I will be documenting my journey in the GSoC program under Radis (OpenAstronomy). This blog is the first in the series of those blogs and will contain a quick overview of what Google Summer of Code is, an intro to the organization I will be working with and the project I will be involved in, and what I did in the 20-day community bonding period.

What is GSoC?

I remember attending one of Programming Club IIT Kanpur’s lectures in my freshman year of college, and my senior just asked the students if they knew what GSoC was. I had no idea. But I glanced over to see if my peers knew something and saw a few of them nodding enthusiastically and a few others muttering among themselves. The senior didn’t explain what GSoC was, but he did ask us to check it out ourselves. I did. I wouldn’t save I understood the entire program back then since I didn’t even know what open source was.

Fast forward around 9-10 months, I started contributing to open source. I really felt it helped me skill up as a developer, which motivated me to participate in GSoC.

Google Summer of Code or GSoC is a program sponsored by Google that aims to connect university students worldwide with open source organizations to promote the open-source culture. Students work with an open-source organization on a 10-week programming project during their break from school and get an opportunity to contribute to high-quality code, learn new skills, and also get compensated for the work. In turn, the organizations benefit from a few extra pairs of helping hands. Any college student interested in software development should definitely check out this program.

Radis and my project

Radis1 is a fast line-by-line code for synthesizing and fitting infrared absorption and emmision spectra such as encountered in laboratory plasmas or exoplanet atmospheres.

Radis aims to provide a wide array of features and remain user-friendly at the same so. It currently supports spectral calculations on databases like HTIRAN and high-temperature databases like HITEMP, CDSD-4000 with a future plan on extending the support to ExoMol. It comes with just a one-line install and post-processing tools for analysis of the spectra. Users can also combine ranges to create a mixture of gases or calculate radiative transfer along the line-of-sight.

RADIS uses Pandas dataframe for handling all the databases currently. Quoting the words of Wes (the core dev of Pandas), “pandas rule of thumb: have 5 to 10 times as much RAM as the size of the dataset” 2. Which makes it impossible to read, say, a database of size 5GB on a machine with a RAM of 16GB.

Pandas Meme

The goal of this project would be first to reduce the memory usage of the current calculations. Then, we replace pandas with libraries that are better suited for handling larger-than-memory databases, which would make it possible to compute spectral databases of up to billions of lines (of the scale of hundreds of GB or terabytes). I will say the core technical details of the project for the upcoming blogs.

Community Bonding Period

The Community Bonding Period is an almost 20-days long period meant to serve as a warm-up or a buffer before the actual coding period begins. It can be used for a wide variety of purposes, such as getting a better understanding of the codebase and figuring out its intricacies. I started out by quickly going over Spectro-1023 again since I had left out a few parts the last time I did. I then studied the RADIS 1 paper. Though I cannot really say the entire document, I did get a top-level idea of how it works and how it is different from other software.

My failed attempts in wrapping up the previous work

After my GSoC application, I started working on a feature request that asked a specific function in the code to return the wavelength and the intensity grid in sorted ascending order. I just assumed that all I need to do was sort the grids, and I did this and created a PR. I later learned that Radis, like any good codebase, has many tests that make sure things don’t break when a new change is made. Apparently, returning the wavelengths and intensity grid in the sorted order broke the physics when combining spectra.

Before this PR, I was unaware of pytests. I went through the documentation4, ran the tests on my machine, and checked out each of the failing tests. This helped me understand different parts of the code, especially the spectrum and los modules of the repository. The tests passed of the spectrum module passed after a few modifications. But, after I updated Erwan regarding my progress, I realized that I need to now design new tests since we cannot pinpoint where we are having problems in the codebase with the existing ones. Besides, I learned about the different types of tests (non-regression, validation, and verification) that exist in RADIS to ensure things don’t break after a brief chat with Erwan.

We have decided how we will tackle this issue, but since I am required to start on my project from tomorrow, I will be getting back to this PR later and hope to find time for the same during the coding period.

Discovery during HITEMP (CO2/H2O) download automation

In the first phase of my project, I am required to use a few hacks5 in the pandas and boost their memory performance. This includes dropping a few columns and changing the datatypes of a few others. Coincidentally Dirk encountered an issue while working on automating the download of CO2/H2O for HITEMP. So, CO2/H2O spectral databases contain multiple zip files, and automatic download of this was not supported in RADIS. Due to NaN values and the np.uint not supporting them, the datatypes of a few columns conflicted when databases were added on top of one another. Currently, this is being handled by returning the parameters in the form of a memory inefficient np.float64. I will have to bring them down to more suitable datatypes (np.uint) most probably. This will probably be the first thing I will do as part of the project.

The next two weeks

In the next two weeks I will be involved in figuring out and implement all the database pre-processing that can be done to boost pandas’ performance4. I will also setup memory performance benchmarks to track these changes. I am super excited to see how this project goes. I would like to thank Google, OpenAstronomy, RADIS and my mentors Erwan Pannier, Dirk van den Bekerom and Pankaj Mishra. I hope to learn a lot of stuff along the way and hopefully I will deliver. So,

Let the Games Begin