Google Summer of Code: Final Submission

Google Summer of Code @ OpenAstronomy

I have completed my Google Summer of Code project, “Fast Parsing of Large Databases and Execution Bottlenecks,” on the Radis project under the OpenAstronomy umbrella. The project developed high-performance, line-by-line parsing for high-resolution infrared molecular spectra, and this is the final blog posts to document the results and lessons learned. Most important of all, I loved the work I was able to do; it was interesting and made possible by my helpful and amazing mentors. I am incredibly grateful to Dr. Nicolas Minesi, Dr. Dirk van den Bekerom and Tran Huu Nhat Huy.

Project Description?

The problem is that the HITEMP CO₂ spectroscopic database is extremely large and inefficient to work with: the distributed file is about 6 GB when compressed but expands to roughly 50 GB, and the existing workflow requires fully decompressing and then batch-parsing the entire file an operation that takes on the order of 2.5 hours and uses a lot of disk I/O, memory, and network bandwidth. That full-decompress/parse approach creates several practical bottlenecks: it forces users to have large, fast storage and long processing windows, makes quick exploratory analysis or iterative development impractical, wastes bandwidth and time when only small portions of the data are needed. In short, the dataset’s size and the parser’s all-or-nothing design prevent efficient, selective access and slow down every downstream analysis that depends on these spectral lines.

Project Walk Through

After discussing the project we divided it into three parts. The first part was optimizing the existing code so that, at a minimum, we would have a better working infrastructure. The second part was enabling partial downloads so a user can retrieve only the necessary part of the file without downloading it entirely. The last part was building a C++ Single Instruction, Multiple Data (SIMD) parser using Intel intrinsic. Below I have described each of these in detail.

Read more…

Building a super-fast SIMD parser for dataset - The final episode

Welcome to the last episode of my Google Summer of Code series. In the previous post I showed how I could seek inside a large .bz2 file and decompress a region to get about 500 megabytes of raw data. That worked, but it still required downloading the full 6 gigabyte compressed file up front. After talking with the maintainers we switched to a partial-download approach: a user requests a region and the system downloads only the 45–65 megabytes of compressed bytes that decompress to the exact 500 megabyte window we need. That change took some extra work, but it makes the system feel immediate for new users, you can get hundreds of thousands of parsed rows in a couple of minutes without pulling the whole archive.

When building a parser that aims to beat Pandas’ vectorized operations, single-threaded concurrency isn’t enough. Concurrency is about handling multiple tasks by rapidly switching between them on a single core. It gives the illusion of things happening in parallel, but at any given instant only one task is actually running. That’s why it feels like multitasking in everyday life where you’re switching back and forth, but you’re not truly doing two things at the same time.

True parallelism, on the other hand, is about dividing independent work across multiple cores so that tasks literally run simultaneously. Each task makes progress without waiting for others to finish, which is what makes SIMD vectorization or multiprocessing so powerful for workloads like parsing large datasets.

Read more…