Building a super-fast SIMD parser for dataset - The final episode

Welcome to the last episode of my Google Summer of Code series. In the previous post I showed how I could seek inside a large .bz2 file and decompress a region to get about 500 megabytes of raw data. That worked, but it still required downloading the full 6 gigabyte compressed file up front. After talking with the maintainers we switched to a partial-download approach: a user requests a region and the system downloads only the 45–65 megabytes of compressed bytes that decompress to the exact 500 megabyte window we need. That change took some extra work, but it makes the system feel immediate for new users, you can get hundreds of thousands of parsed rows in a couple of minutes without pulling the whole archive.

When building a parser that aims to beat Pandas’ vectorized operations, single-threaded concurrency isn’t enough. Concurrency is about handling multiple tasks by rapidly switching between them on a single core. It gives the illusion of things happening in parallel, but at any given instant only one task is actually running. That’s why it feels like multitasking in everyday life where you’re switching back and forth, but you’re not truly doing two things at the same time.

True parallelism, on the other hand, is about dividing independent work across multiple cores so that tasks literally run simultaneously. Each task makes progress without waiting for others to finish, which is what makes SIMD vectorization or multiprocessing so powerful for workloads like parsing large datasets.

Read more…

RADIS Web App Final GSoC Blog

In this final update, I’ll be covering the latest improvements.

Option to run the backend locally with Docker:

  • Simply follow the steps in the app to get started:
  • Improve performance if local machine is powerful.
  • You can even mount your existing databases (if you’ve used the RADIS package) by running the container with the following command:
docker run -d -p 8080:8080 \
-v /home/mohy/.radisdb:/root/.radisdb \
-v /home/mohy/radis.json:/root/radis.json \
radis-app-backend

Moved to SpectrumFactory instead of the simpler calc_spectrum:

  • This enables GPU-based calculations if implemented in the future.
  • Also improves ExoMol database performance by avoiding unnecessary broad file downloads.

UI improvements:

  • Better layout and space utilization to make the graph display larger and clearer.

There is a new option to choose between using all isotopes or only the first one.

Testing

I’ve added tests for all the new features on both the frontend and backend, and ensured coverage of the core existing functionalities.

Overall, it has been a fantastic journey working on this project with such a supportive community and mentors.

Read more…

We’re onto something.

I know I’m a bit late with this update, but I’ve been deep in the process of making everything work. The pieces we’ve been putting together since the beginning of the project are finally falling into place.

Since my last blog post, I resolved the broken data flow during the parsing of the states file. Now, all the necessary data flows cleanly through to the spectrum calculation step.

I also began getting some calculated values but something was off. Instead of showing normalized populations per electronic state, it was printing partition function values. Once corrected, the values made sense and were properly normalized.

Read more…

Bench-marking Partial Decompression

Hello everyone, and welcome back to another episode of my Google Summer of Code project series! In my previous post, I introduced a clever partial decompression mechanism. If you haven’t had a chance to read it yet, check it out here. I promise it’s worth your time. In this post, I’ll share the benchmarks and findings for the new functionality.

Both the previous and new implementations require downloading the full 6 GB file before processing, so download time is excluded from all benchmarks below. In the original approach, after downloading, the entire file is parsed into a DataFrame and stored in HDF5/H5 format. As expected, this process is painfully slow when you only need to extract, say 2.5 GB out of 50 GB worth of data from the decompressed stream. This is where my partial decompression logic shines.

Parsing 2.5 GB Without Cache

Below is the comparison of parsing 2.5 GB of data (without cache) using the old mechanism versus my new solution:

Read more…

RECIPES RECIPES WHAT KIND OF RECIPES? :)

 

Exploring Light Curve Plotting in Stingray.jl: Recipes and Examples:

In my continued work with Stingray.jl, so hello everyone, let's learns about plotting my favorite topic:)

Light curves are essential in high-energy astrophysics, as they represent the brightness of an astronomical object as a function of time. Precise visualization and filtering of these curves help astronomers perform accurate timing analysis, detect variability, and identify astrophysical phenomena.

This post demonstrates how to generate and customize light curve plots using Stingray.jl, leveraging real NICER datasets.

Read more…

Fitting Feature Now Available in the RADIS App

The main goal of fitting is to find the best values for unknown parameters (like temperature Tgas, mole fraction, etc.) that make your theoretical (simulated) spectrum match the experimental data as closely as possible.

This feature was not previously available in the app, but it is now (once the PR gets merged).

To use it: Activate Fit Spectrum Mode from the header to open the fitting form.

Read more…

Seeking Fast at Any Point in a BZ2 Compressed File

Hey everyone welcome to the second episode of my Google Summer of Code project series, where I’m working on partial decompression for large datasets.

So, what’s the big catch here? Well, the CO₂ dataset I’m working with is about 6 GB in its compressed .bz2form, and when you decompress it, it explodes into 50 GB. Most systems struggle to load that much data into memory or parse it into a DataFrame, either due to storage limits, memory, or swap issues.

And obviously not everyone wants the whole 50 GB anyway. Usually, people need just a 1 GB chunk from somewhere inside. So decompressing the entire thing just to fetch a small part is a massive waste of time and resources.

Read more…

Things Are Starting to Come Together

The past couple of weeks have been really productive. After the initial planning and community bonding period, I’ve finally started working on the actual implementation. The transition from understanding the theory to getting hands-on with the code has been challenging but rewarding.

After writing my last blog post and going through the codebase more thoroughly, I discovered there are several existing classes and methods that can be reused for this project. This has led to a revised approach that builds on what’s already working well in RADIS.

Revised Approach

The core approach is the same as originally planned, but now the rovibrational populations are calculated using the existing RovibParFuncCalculator. This lets me reuse code while adding the electronic state functionality we need.

Read more…

✨Good Time, Bad Time: GTI/BTI :)

In my continued journey with Stingray.jl During GSoC 2025, this phase focused on a core aspect of high-energy astrophysics: time filtering using GTIs (Good Time Intervals) and BTIs (Bad Time Intervals). After a productive discussion with my mentor @matteobachetti during our meet, I dove into implementing and refining functionality around GTIs—an essential tool in the timing analysis of astrophysical data.