Bench-marking Partial Decompression
Hello everyone, and welcome back to another episode of my Google Summer of Code project series! In my previous post, I introduced a clever partial decompression mechanism. If you haven’t had a chance to read it yet, check it out here. I promise it’s worth your time. In this post, I’ll share the benchmarks and findings for the new functionality.
Both the previous and new implementations require downloading the full 6 GB file before processing, so download time is excluded from all benchmarks below. In the original approach, after downloading, the entire file is parsed into a DataFrame and stored in HDF5/H5 format. As expected, this process is painfully slow when you only need to extract, say 2.5 GB out of 50 GB worth of data from the decompressed stream. This is where my partial decompression logic shines.
Parsing 2.5 GB Without Cache
Below is the comparison of parsing 2.5 GB of data (without cache) using the old mechanism versus my new solution:

Parsing 2.5 GB With Vectorized Optimizations
In the above benchmark, parsing 2.5 GB of data with new implementation using the original regex-based slicing parsing took around 6.5 minutes. By replacing these expensive regex operations with a more efficient index-based slicing approach, I was able to bring that time down to just 4.55 minutes. This change resulted in a performance boost nearly a 30% improvement.

Parsing 2.5 GB With Cache
Next, I measured the performance of parsing 2.5 GB of data when reusing the cached files generated during the initial parse. Thanks to caching, subsequent reads avoid decompressing and re‑indexing the same regions, yielding dramatically faster turnaround for typical query sizes. In our tests, parsing 2.5 GB from the cache dropped from 4.55 minutes down to just 1.2 minutes an almost 75 % improvement over the original uncached workflow. Of course, if users workload involves scanning the entire 50 GB dataset in a single pass, the relative gains from caching diminish significantly; I’ll cover end‑to‑end file performance in my next post.

The next major update will leverage C++ SIMD intrinsics (Single Instruction, Multiple Data) to accelerate parsing at the assembly level. That optimization will push performance even further, and I’ll share the details and code examples in my next blog post. As always, thank you for reading, and stay tuned!