I have completed my Google Summer of Code project, “Fast Parsing of Large Databases and Execution Bottlenecks,” on the Radis project under the OpenAstronomy umbrella. The project developed high-performance, line-by-line parsing for high-resolution infrared molecular spectra, and this is the final blog posts to document the results and lessons learned. Most important of all, I loved the work I was able to do; it was interesting and made possible by my helpful and amazing mentors. I am incredibly grateful to Dr. Nicolas Minesi, Dr. Dirk van den Bekerom and Tran Huu Nhat Huy.

Project Description?

The problem is that the HITEMP CO₂ spectroscopic database is extremely large and inefficient to work with: the distributed file is about 6 GB when compressed but expands to roughly 50 GB, and the existing workflow requires fully decompressing and then batch-parsing the entire file an operation that takes on the order of 2.5 hours and uses a lot of disk I/O, memory, and network bandwidth. That full-decompress/parse approach creates several practical bottlenecks: it forces users to have large, fast storage and long processing windows, makes quick exploratory analysis or iterative development impractical, wastes bandwidth and time when only small portions of the data are needed. In short, the dataset’s size and the parser’s all-or-nothing design prevent efficient, selective access and slow down every downstream analysis that depends on these spectral lines.

Project Walk Through

After discussing the project we divided it into three parts. The first part was optimizing the existing code so that, at a minimum, we would have a better working infrastructure. The second part was enabling partial downloads so a user can retrieve only the necessary part of the file without downloading it entirely. The last part was building a C++ Single Instruction, Multiple Data (SIMD) parser using Intel intrinsic. Below I have described each of these in detail.

Optimizing The Existing Parser

While profiling the parser, the main performance bottleneck turned out to be the regex extraction step. For example, we were parsing the globu column using a regular expression:

df["globu"]
  .astype(str)
  .str.extract(
      r"[ ]{9}(?P<v1u>[\-\d ]{2})(?P<v2u>[\-\d ]{2})(?P<v3u>[\-\d ]{2})",
      expand=True,
  )

Although functional, regex-based parsing is computationally expensive, especially when applied repeatedly to large datasets. Since this is HITEMP data, the format is fixed-width and consistently aligned. That means we don’t actually need regex for column extraction we can replace it with index-based slicing, which is much faster.

_GLOBU_SLICES = {
    "v1u": (9, 11),
    "v2u": (11, 13),
    "v3u": (13, 15),
}
for name, (i0, i1) in _GLOBU_SLICES.items():
            series = df["globu"].str.slice(i0, i1).str.strip().replace("", "0")
            df[name] = series.astype("int64")

Here I achieved a time reduction of 35–45 percent depending on the molecule, which is impressive considering the low-effort change. Here are the benchmarks and pull requests I made for this section:

I wrote a blog for this part and explained in detail what and how the problem occurs: Blog Link

Building A Partial Download And Decompression Algorithm

As the compressed dataset is 6 GB, downloading it completely even when the user wants only 45 MB of it (which decompresses to ~500 MB) is a waste of time and resources. I built a partial download and decompression mechanism, which was quite complicated because the file is bz2-compressed and you cannot randomly seek at any point in the compressed file. For that, I created a mapping that maps wave-number (used for user querying) to corresponding offsets in the decompressed stream, and another offset mapping which maps to the respective compressed blocks of that decompressed offset. This way we can query the remote dataset and download the 45–65 MB of compressed data which, on decompression, is exactly a fixed ~500 MB width so it can be used for caching later. Thanks to Dr. Dirk van den Bekerom for his help dealing with the decompression and offsets.

Link of the Pull Request: Implements Block-Aligned Partial Download Decompression and Caching System for HITRAN CO2

I have described this in detail in these two blogs:

Replacing The Python Parsing With C++ SIMD Parser

As I mentioned above, I was able to get a 500 MB chunk of decompressed data (.par format) and parse it with the existing parsing mechanism. Even though in the “Optimizing the existing parsers” part I replaced the pandas regex operation with index-based slicing (improving performance by ~38 percent), this could be improved further. So I constructed a C++ SIMD parser which is a lot faster.

The Pull Request for it is not merged yet because integrating C++ code with Python is a little complicated in Radis: there exists a separate repository pyvkfft_bin which will compile this code in the workflow and the executable file will be used in Radis. I’m working on other reviews on the maintainer side, but hopefully this will be merged soon as well.

I described how I built the parser in detail here in this blog: building-a-super-fast-simd-parser.

The End

This is the end of a wonderful journey, and there are a lot of things we can do to make Radis and the work I did even better, such as handling other build systems for fast parsing or replacing it with a system-independent parser, since the current one does not support ARM. As a result, users have to fall back to the Python-optimized parser, which is fast but could be faster. That’s something for next year. I would love to mentor someone to make this happen. Yes, this summer was interesting. I hope I get to work on cool stuff like this again!