GSoC - 4

Radis

Today is the last day of GSoC-21. The entire journey was a rollercoaster ride and I learnt a lot of new things along the way. I started out with hardly knowing any of the shortcomings of pandas and as we dug in, I was surprised to see so many loopholes it contains. This will be the final blogpost of my gsoc journey and I hope you like it.

Pandas and Vaex

Why do people use Pandas ?

Pandas is an open source Python package that is most widely used for data science/data analysis and machine learning tasks. It is built on top of another package named Numpy, which provides support for multi-dimensional arrays.

Pandas makes it simple to do many of the time consuming, repetitive tasks associated with working with data, including:

  • Data visualization
  • Statistical analysis
  • Data inspection
  • Loading and saving data
  • Data cleansing
  • Data fill
  • Data normalization
  • Merges and joins

As of the time I this blog was written pandas is arguably the most popular dataframe library that data scientists use. While pandas works smoothly while dealing with smaller data, it becomes very slow and inefficient when there are huge datasets.

Why use Vaex ?

Vaex is a python library that is closely similar to Pandas. Vaex is a library especially for lazy Out-of-Core DataFrames, helps to visualize and explore big tabular datasets. It is a high performance library and can solve many of the shortcomings of pandas. As the API is similar to pandas, users do not face difficulty in shifting.

Vaex is capable to calculate statistics such as mean, standard deviation etc, on an N-dimensional grid up to a billion (109109) objects/rows per second.

Pandas or Vaex ?

Here at Radis the underlying algorithm was not able to perform to its maximum capacity due to usage of pandas which consumes way too much of memory. So we tried to see how vaex can help improve the performance.

Below we are using HITEMP-N2O Database for all checking the performance. It is to be noted that there is a difference between the pytables that pandas use and vaex friendly HDF5. The former is row-based whereas vaex friendly HDF5 files are column based.

Loading time

In the code below we -

  1. load the the vaex HFD5 file and then convert it to pandas dataframe
  2. directly load the pandas hdf5 file
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
> from time import time
> import vaex
> t0 = time()
> df = vaex.open("~/.radisdb/N2O-04_HITEMP2019.hdf5")
> df_pandas = df.to_pandas_df()
> print(time()-t0)
7.833287477493286
> t0 = time()
> import pandas as pd
> df_pandas2 = pd.read_hdf("~/.radisdb/N2O-04_HITEMP2019.h5")
> print(time()-t0)
28.142656087875366

Clearly the first appraoch is almost 4 times faster than the second one.

Load specific columns

As already stated, vaex hdf5 files are column based so loading only specific columns from vaex hdf5 file should be able give much better results than loading only specific columns in pandas. Lets check this and see the time taken to do both of these -

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
> t0 = time()
> df = vaex.open("~/.radisdb/N2O-04_HITEMP2019.hdf5")
> df_pandas = df.to_pandas_df(column_names=["iso", "wav", "int", "El"])
> print(time()-t0)
0.1795198917388916
> t0 = time()
> import pandas as pd
> df_pandas2 = pd.read_hdf("~/.radisdb/N2O-04_HITEMP2019.h5", columns=["iso", "wav", "int", "El"])
> print(time()-t0)
22.85481858253479

In comparison, loading 4 out of 19 columns is about 70% as slow with Pandas.

Load specific rows

To be fair to the pytables let’s try to load specific rows and check if pandas can now provide better performance with its row indexed HDF5s.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
> t0 = time()
> import pandas as pd
> df_pandas2 = pd.read_hdf("~/.radisdb/N2O-04_HITEMP2019.h5", where="iso==1")
> print(time()-t0)
30.680099725723267
> t0 = time()
> df = vaex.open("~/.radisdb/N2O-04_HITEMP2019.hdf5")
> df.select(df.iso == 1)
> df_pandas = df.to_pandas_df(selection=True)
> print(time()-t0)
7.043155670166016

Even in this case vaex provides better performance. So the idea was to harness this memory efficiency of vaex for all the I/O operations on the dataset in Radis. In order to do this I have written down a HDF5 writer that fetches bz2 file and parses it into a column-major HDF5. The complete code to the HDF5 writer can be found in this gist.

That is it for GSoC21 from my side. Even though the second phase of my project got affected due to schools, I had an exciting summer as a whole. I am looking forward to be in touch with Radis and will try to contribute to it whenever I get a chance.