Finally, I have been contributing to scientific open source for about a year now and it has taught me a lot I mean a lot, I still remember searching for a simple documentation issue so I could get it merged and call myself a “contributor” haha, and that got me started in scientific open source. Since then, I have been able to make some truly meaningful contributions to many projects, and here I am writing a blog for GSoC with Radis which is a pythonic library for fast line-by-line code for high resolution infrared molecular spectra, under the OpenAstronomy umbrella.
So my project is cool ngl, and it is titled “Fast Parsing of Large Databases and Execution Bottlenecks” basically there exists a large highly compressed CO₂ spectroscopic database of size 6 GB file that decompresses to about 50 GB and takes at least 2.5 hours to parse and convert into a DataFrame. As you might expect, my project is about significantly reducing the parsing time and finding a workaround for storing only the “necessary” parts of the decompressed file.
Radis is pythonic and sometimes python gets real slow if not used the way it is meant to be used, so my initial thought was to first clean up the existing code and use vectorised operations and with numba we should be able to see some real optimisation. But then I realized the current implementation already has the right vectorized operations on DataFrames, and Pandas’ vectorized methods are already implemented in optimized C/Cython loops. So there isn’t much more to do here other than replace a few overheads with other operations. After that, as discussed with my mentors I can use a C++ Single Instruction Multiple Data (SIMD) mechanism to parse the file and create a python interface on top of that with pybind11 or cython and other option but this will cost us portability as compilation will be a thing that is to considered. Other approach which is using vulkan API in python as it supports CPU as well GPU parallelism and its cross platform as it will works on all CPU architectures.
Read more…