Results are out!!!

Om Biradar — Sun, 14 Jun 2026 11:08:19 GMT

My GSoC project's main goals includes parallelizing the reading of structs in parquet files. This is very beneficial to the community as astronomical data involves data stored in structs and other nested structures.

This figure shows the performance increase/decrease the optimized parquet reader (which I made) has over the baseline main branch of apache/arrow when run on a standard free GitHub runner and the file in loaded into the RAM to remove the I/O overhead which is not relevant here.

For smaller files, the overhead of multi threading causes it to be slower, but for real life cases when the file sizes are large, the optimized reader now provides 25%-30%+ faster reading times.

When compared to flat parquet files, the nested struct now offer similar read speeds with multi threading enabled!!

Overally, the project is going at a great pace with verifiable results. I hope this integrates with the upstream apache arrow library soon so that this performance boost can help the people working with PyArrow on astronomy datasets.

The link to the PR - https://github.com/apache/arrow/pull/50158

Link to the results colab notebook - https://colab.research.google.com/drive/1TsxFkBSI_Iq0hfXEwNDs_D24acr3yxdC?usp=sharing

Orchestrating and benchmarking repo - https://github.com/OmBiradar/pyarrow-lincc-fw-openastronomy-gsoc26

Community Bonding and week 1 & 2

Om Biradar — Tue, 09 Jun 2026 02:27:42 GMT

The start was amazing! The community bonding was really great. I got to meet the mentors, get to know the whole organization structure of OpenAstronomy and LINCC frameworks, the work they do, the people and facilities associated with it, the ways PyArrow and nested-pandas was being used in astronomy and the expectations they had form the internship. They offered to help me through tasks if I could not do it on my own along with some content to look through to get a deeper understanding of the project. I attended the Apache Arrow community meeting with my mentor to introduce the project them and get their views on it. They were really helpful and even suggested certain thing to do to improve the final PR.

Week 1 and 2 were spent on improving the parallel reading of parquet files, which was successfully implemented by me. The benchmarking of these performance changes proved quite hard, as this would require the following, starting with the main arrow branch:

Building arrow C++ from source
Building PyArrow from source
Running benchmarking scripts
Switching the branch from the main branch and repeating steps 1-3

The time taken to benchmark the changes for all order of magnitude of files proved to be very long, approximately ~140 hours or 6 days. I could speed this up using parallel jobs in github which needed to be configured separately using config settings and matrix github runners. This needed to be done because github sets a timeout of 6 hours on each github action with at max 20 concurrent jobs and a 24 hour auto cancel timeout on jobs in queue. Based on these constraints I had to figure out a way to orchestrate different runners and combine their results for which I used sqlite3.

Universe OpenAstronomy (Posts about lincc-fw)

Results are out!!!

Community Bonding and week 1 & 2