Universe OpenAstronomy (Posts about gnuastro)

Final GSoC Report

Labib Asari — Mon, 21 Aug 2023 23:00:00 GMT

I will be discussing the goals of my GSoC project, how I spent my time and what I learned during this period. I will also be discussing the future of my project and what I plan to do next.

Goals of my GSoC

The original Google Summer of Code project this was year was to :

Redisgn the error handling inside Gnuastro C library.
Adding wrappers for Gnuastro library functions in pyGnuastro.

Prior to GSoC, my experience mostly consisted of Deep Learning and Computer Vision. I had a good high-level understanding of how GPUs were leveraged for the compute intensive tasks in various libraries and frameworks in these domains. I had started exploring the lower-level abstractions over GPUs using the CUDA framework.

In the early weeks of February, I delivered a presentation to the Gnuastro development team. The point of this presentation was a proposal outlining the integration of GPU support into Gnuastro — an idea borrowed from the Machine Learning world but with huge advancement potential in the feild of Astronomy. Both of these domains process huge amounts of data. Both of these domains are characterized by the processing of substantial volumes of data.

My mentor Mohammad Akhlaghi was very supportive of this idea and gave me the go ahead to start working on it.

And so, we had a 3rd goal for this GSoC project :

Adding GPU support to Gnuastro.

Work Done Throughout the GSoC

Error Handling

Need : All of Gnuastro’s library functions performed error handling using error(EXIT_FAILURE, ....); thus exiting the program whenever an error was encountered with a detailed error message. This wasn’t a problem for the Gnuastro programs however for other callers like pyGnuastro, this is problematic as it exits from the entire Python environment.

The new error handling mechanism defines a module error.h new data structure gal_error_t. The exact contents of this structure have gone through multiple iterations but the final one is :

The user should define a gal_error_t before the function call and pass it as an argument to the function(every function in Gnuastro will have an extra argument now).

During the function execution, if any error occurs, it will populate the gal_error_t with the error message and the error code. The user can then check the error code and the error message to determine what went wrong.

Corresponding functions are added in error.h for writing and managing the structure. Some methods are also provided for Python interface.

After the module was finished, Mohammad implemented the new error mechanism inside the cosmology.c module, and then I used it to update the corresponding cosmology module in pyGnuastro. This solved the main the problem of python environment exiting on any error, instead errors were being reported inside the python shell.

This completed setting-up the low level infrastructure for the new error handling mechanism. This can be now used by other modules of Gnuastro to update what happens when an error occurs. Implementing the high level error function calls, deciding the exact error type and defining what message should be shown, would be best done by the original authors of the modules.

The new error handling mechanism currently lives at the Gnuastro repository.

pyGnuastro

Apart from implementing the new error handling mechanism in existing modules of pyGnuastro, I worked on 2 major things

Implemented speclines module in pyGnuastro : this is a simple module without any complex data structures. I tried this first when I was learning about the C-Python API. It gave me a good grasp of how and what’s going on in the existing pyGnuastro implementation.
GAL_DATA_T for Python : The core data structure of Gnuastro - gal_data_t is a C struct. Any external data is represented using this structure. It was crucuial to had a similar structure in Python. Previously Jash had worked on loading and saving fits file made use of the Numpy-C API to to convert the raw data inside the gal_data_t to a Numpy array. This was an extremely clever and efficient idea, however it skipped all the other details inside gal_data_t. We had to find a way to represent the entire gal_data_t in Python. The normal way to create a new data structure in Python would be to create a new class. However, the wrappers are written in C language and we don’t get access to the Python interpreter. I took some more inspiration from Numpy on how they created a new Python - their core data structure : numpy.ndarray - using the C-Python API. I then discovered the API allows us to define custom objects which may be used a data type for the Python interpreter. I learnt and used them to have a corresponding pygnuastro.data for pyGnuastro. It basically acted as a new data type in python similar to numpy.ndarray, had other details of gal_data_t.After this we had details of gal_data_t in python but we were missing on Jash’s idea of utilizing Numpy in pyGnuastro. I spent some time to make sure we can still utilize numpy’s speed inside pyGnuastro, The C-Python API is versatile and it allows having complex objects as sub-objects to other objects. Eventually we had the array(raw data) being represented as a numpy.ndarray! This meant we had both the speed of numpy and the details of gal_data_t in pyGnuastro’s pygnuastro.data. This was a major milestone in pyGnuastro.

GPUs in Gnuastro

Gnuastro is an astronomical data analysis and manipulation library. Astronomical data is usually very large in size, and thus computationally intensive. If the operations performed on this data are parallelizable, then GPUs can significantly speed up the processing.

I started my work on GPUs right after Mohammad approved my initial idea. Here’s a summary/story of all the work done for GPU support :

Learning about build systems : After GPU support idea was accepted, my mentor suggested we should first setup the build system so CUDA modules can be integrated smoothly in the future. Gnuastro uses Autotools for its build system. I started by learning about autoconf, automake and libtool.
Linking Gnuastro with CUDA runtime : CUDA SDK provides a runtime library - cudart which the necessay component to initiate communication with the GPU drivers. The runtime library is distributed as both a static and shared object file. This made things easier as we could link the runtime library statically with the Gnuastro library, making cudart part of Gnuastro. I modified the configure script to link the runtime library statically with Gnuastro. This was also the time I learnt extensively about how low level system libraries are built, linked and distributed.
Struggling with Libtool : I then tried to implement some simple matrix functions in CUDA and integrate them with Gnuastro. CUDA source code is compiled by nvcc compiler. However during linking, libtool assumes that all source files are compiled by gcc. It ignored all the CUDA source files. After writing dedicated rules for CUDA source compilation in the Makefile, the CUDA source was getting compiled, but not being linked to the Gnuastro. Libtool only links files having a corresponding libtool object(.lo files) and they’re created by libtool for each source file handled by it(which in our case were gcc compiled files).
AutoMake developers rescuing us : After trying and struggling with libtool for a few days, my mentor suggested that I contact the AutoMake developers to seek some help. I mailed them a small demonstration of what I was trying to do and waited for there response. After a few days, I received a reply from them. The fix was actually simple, automake had special variables(LD_ADD) which directly communicates with the GNU linker (ld) and I just had to add CUDA object files to this variable. It worked and we finally had a working CUDA module in Gnuastro which used GPU for execution!

It was around 1st week of April now, I made my final proposal submission and had fingers crossed for getting selected in GSoC.

As mentioned in the GSoC proposal, we had to first focus on the Error handling and Python wrappers, so I started working on these two goals (I was also indeed selected for GSoC in the meantime!).

Convolutions on GPU : After getting back to working with GPUs in around June-July, I started with implementing the convolution function in CUDA. Convolution is a direct operation as well as a subroutine to other operations in Gnuastro. The results of CUDA convolution were remarkable. We got upto 400x speed up on convolution operation! My mentor then suggested me since the speedup is very significant, I should prioritise getting more of GPU work done. Read more about Convolution on GPU in my blog here.
Adapting OpenCL : CUDA is a proprietary framework by Nvidia. It only works on Nvidia GPUs. We wanted to make Gnuastro GPU support available to all users, irrespective of the GPU they have. This is where OpenCL comes in. OpenCL is an open standard for parallel programming of heterogeneous systems. It is supported by all major GPU vendors. I started learning about OpenCL and how it works at a low level. I also started learning about the OpenCL C99 programming standard. Read more about starting with OpenCL in my blog here.
Integrating OpenCL : OpenCL was initially hard to learn, but I managed to integrate that with Gnuastro right before my GSoC’s official timeline was about to end! I have a pretty detailed blog on the the entire integration process here.
Same code on CPU and GPU : After we had success with OpenCL, my mentor recommended we should try executing the exact same code on CPU and GPU - to show the concept of executing same instructions both processors and seeing the speed-up on GPUs. This was never done in the field of Astronomy so it’d have been a great demonstration. This was quite challenging as GPUs are programmed with different frameworks and have some extra components in code for management. Usually in Machine Learning frameworks, the GPU and CPU modules are generally written seperately(Infact Tensorflow used to have different package altogether for GPU until 2.0) However the good part is, most of the GPU frameworks are derived from C/C++ language and have . I spent my last week of GSoC trying to implement the core logic in a Macro which will be shared by both OpenCL kernels and C library and had success, this can be accessed here.

Future : The future of this project is very bright. I have set up the bare-bone GPU integration already, I’ll continue to add GPU modules building upon it.. We have a working OpenCL integration. We have a working CUDA integration. We have a working CPU-GPU code sharing. I mentioned certain challenges we are currently facing in my opencl_integration blog. I’ll continue to figure out a solution for them and adding support for further modules on GPU.

Acknowledgements

GSoC has been a great learning experience for me. I’m extremely grateful to everyone who was part of this journey.

I would like to thank my mentor Mohammad Akhlaghi for his constant support and guidance throughout the project. He has been very patient right from the beginning, beleived in me when I did not have a clear idea on how I’d approach all the goals. He allowed me work on my pace, explore and learn things as needed and has always pulled me out of the rabbit hole whenever I got stuck. Everytime I join a meeting with him, I learn something new. I’m very grateful to him for giving me this opportunity to work on this project.

I am Graciously thankful to Jash Shah for introducing me to the Gnuastro development team and walking me through the existing work on error handling and pyGnuastro. It provided me a huge boost was extremely valuable. He’s always been attentive to my small queries and has supported me through multiple challenges. In general, Im very grateful to have him as a mentor and freind.

I would also like to thank the Gnuastro development team for their support and feedback throughout the project. Its been such a wonderful time working with them. I have learnt a ton from attending Pedram’s work on adding Sql to Gnuastro, Fathma’s work on Tiff files and Curl library, Faezeh’s work on implementing Convolutional Neural Networks in Gnuastro. They’ve always been crucial in providing feedback and suggestions on my work. I’m very grateful to them for their support. I am genuinely grateful for the opportunity to collaborate with such a talented and committed group, and I look forward to work and grow with them in the future.

I would also like to thank the Google Summer of Code team for taking the wonderful initiative and giving me this opportunity to work on this project.

Integrating OpenCL with Gnuastro

Labib Asari — Fri, 11 Aug 2023 23:00:00 GMT

Background

In the last post, I discussed what is OpenCL and why we chose to integrate it with Gnuastro. In this post, I’ll be discussing the actual implementation and the challenges I faced.

Programming in OpenCL

The OpenCL 3.0 standard has done a great job of simplifying the programming model. The OpenCL 3.0 API is a header-only library that provides a modern, object-oriented interface to the OpenCL runtime. It is designed to be easy to use and provides a abstraction of the OpenCL runtime, making it easier to write portable code across different OpenCL implementations. We still have to communicate with the driver (unlike CUDA) at a low level, but this becomes a mandatory step when we want to run our code on different hardware (CUDA always expects an NVIDIA device).

Here’s a general overview of steps to be followed when writing an using OpenCL :

Check for available Platforms : A platform is a collection of OpenCL devices. A platform can be a CPU, GPU, or an FPGA (Remember OpenCL can work with any platform!). This is done specifically to identify which OpenCL implementation will be used during runtime. We can query the system for available platforms using the clGetPlatformIDs function. This function returns a list of platforms available on the system.
Check for available devices : A device is a physical device that can execute OpenCL kernels. A device can be a CPU, GPU, or an FPGA. We can query the system for available devices using the clGetDeviceIDs function. This function returns a list of devices available on the system.
Create a context : A context is a container for all the OpenCL objects. It is used to manage the memory, command queues, and other OpenCL objects. It is created by passing a list of devices to the constructor. Since OpenCL can work with multiple devices, we can create a context with multiple devices. This is useful when we want to run our code on multiple devices at the same time.
Create a command queue : A command queue is used to queue up commands for the device to execute. The command queue is used to give commands to the device. The device executes the commands in the order they are received. The commands can be kernel execution, memory transfer, or any other OpenCL command. We can also create multiple command queues. This is useful when we want to run to multiple commands. Command queues in OpenCL are asynchronous by default. This means that the commands are queued up and the control is returned to the host. The host can then continue with other tasks. We can also create a synchronous command queue. This means that the commands are queued up and the control is returned to the host only when the commands are executed.
Load the Kernel : A kernel is a function that is executed on the device. It is written as per the C99 standard. We can load the kernel from a file or we can write the kernel inline. To maintain portablitiy, OpenCL kernels are generally compiled at runtime using clBuildProgram. We can also compile the kernel offline. This is useful when we want to compile the kernel for a specific device.
Copy Data to device memory : All the data used in kernel, must be on the device memory. So we have to copy the data from the host to the device memory. We can do this using the clCreateBuffer function. This function creates a buffer on the device memory. We can then copy the data from the host to the device using the clEnqueueWriteBuffer function. This function copies the data from the host to the device.
Launch the kernel : We can launch the kernel by passing the kernel object to the command queue. We have to set the arguments for the kernel seperately, using the clSetKernelArg function. We can also set the global and local work size. The global work size is the total number of work items that will be executed. The local work size is the number of work items that will be executed in a work group. The global work size should be a multiple of the local work size. If the global work size is not a multiple of the local work size, then the global work size is rounded up to the next multiple of the local work size.
Read the output : We can read the output from the device using the clEnqueueReadBuffer function. This function copies the data from the device to the host.

Implementation

Among all the steps mentioned above, everything up till loading the kernel is common to all the programs that’ll be using OpenCL. So we defined a gpu_utils module which is responsible for querying for the available platforms and devices, creating the context and command queue, loading and compiling the kernel. The only external data it requires is the path to the kernel file. This is provided as an input. It also provides utility functions to copy specific data types to and from device memory.

There’ll be 2 types of OpenCL program in Gnuastro :

Programs using OpenCL to speed-up existing operations inside Gnuastro.
User defined OpenCL kernels, responsible for performing a custom task.

Programs using OpenCL to speed-up existing operations inside Gnuastro

These programs will be using OpenCL to speed-up existing operations inside Gnuastro. For example, we can use OpenCL to speed-up the astconvolve operation by passing an extra --gpu. For these programs, the OpenCL kernels will be part of the Gnuastro Library.

The general flow of the program then becomes :

The user passes the input data for a specific operation, and also choses the local and global work size.
The program then initializes the device using gpu_utils module by providing the kernel file from the library, which does everything and returns a cl_kernel (which is essentially the compiled kernel).
Data transfer from CPU to device (GPU) is done using the functions provided by gpu_utils module.
The kernel is launched using with the provided global and local work size.
Data is copied back to CPU memory and returned to the user.

User defined OpenCL kernels, responsible for performing a custom task

These programs will be using OpenCL to perform a custom task. For example, we can use OpenCL to perform a custom convolution operation by passing a custom kernel. For these programs, the OpenCL kernels will be provided by the user. The exact design details yet to be determined for this.

Results

Input image is 10,000 x 20,000 random image with normal distribution. Kernel is 7 x 7 standard convolution kernel. CPU : Intel(R) Core(TM) i5-9300HF CPU @ 2.40GHz GPU : NVIDIA GeForce GTX 1650

Convolution using existing convolution in Gnuastro :

Convolution on OpenCL :

Result

The speed up for convolution operation is specifically ranges from 300-500x, but for the entire operation its around 3-5x due to the overhead of copying data to and from the device. Overcoming this is a big and important challenge!

Challenges

No GAL_DATA_T inside OpenCL kernel! : Inside OpenCL, cl_mem is the primary object used to represent memory objects such as buffers and images. It is used to allocate memory on the device. Regardless of where the data is coming from on device (arrays, structs, etc), it’s all converted into a cl_mem object when copied to the device.

However inside Gnuastro, the core data structure is gal_data_t which is essentially just a C struct.

Why is this a problem? Well the raw data of the input image/table is not contained inside the gal_data_t. It merely consists a pointer to that data! So wehn we copy the gal_data_t to device, the raw data(which is huge) is not copied. (It lives on the CPU memory, and hence cant use CPU pointers on GPU memory).

What about copying the raw data seperately on the GPU memory, and then replacing the pointer inside gal_data_t with a pointer which has the address on the GPU memory? Well, this is not possible either. Why? See, when we are on CPU, we’ve a good gal_data_t struct which is a single big object with ‘sub-objects’(one of which is the pointer). But on GPU, we’ve a cl_mem which is an object, but unlike structs, it cant have sub-objects!

How do we solve this? Currently all the required pointers inside gal_data_t are passed as seperate arguments to the kernel. After a careful study of the internal implementation of the cl_mem object, we’ll see if we can directly pass the gal_data_t to the kernel.

Data Transfer Overhead : As mentioned multiple times, for using GPUs, we must copy data to and from the GPU memory. Astronomical datasets are huge, and copying them for each operation is a big overhead! Infact the data transfer overhead is so huge, that the actual operation is much faster than the data transfer. Adding more to that, its not just faster, its much much faster! So much so that around 95% of the time is spent in copying data to and from the GPU memory. It reduces performance by ~100x! It can’t continue this way!

One solution we’ve figured is, when the External data is loaded for the first time in the program, we load it on the GPU memory instead of the CPU memory. This way, for each subsequent operation, we dont have to copy the data from CPU to GPU memory. After all the operations are done, we’ll copy the result back to CPU memory and save it to the disk. This will avoid almost all the Data Transfer overhead.

This is about the same approach used by Machine Learning Libraries such as Tensorflow. Basically during initialization, it occupies all the GPU memory it can, and keeps it occupied. All the operations, their results and the subsequent operations are done on the GPU memory itself.

Moving towards OpenCL

Labib Asari — Thu, 27 Jul 2023 23:00:00 GMT

Background

So far, all my work on GPUs has been using CUDA. But CUDA is proprietary to NVIDIA and only works on NVIDIA GPUs. So, I’ve been working on moving the code to OpenCL, which is an open standard for parallel programming on heterogeneous systems.

OpenCL

OpenCL(Open Computing Language) is an open standard for cross-platform, parallel programming of diverse accelerators(CPUs, GPUs, FPGAs, etc) found in supercomputers, cloud servers, personal computers, mobile devices and embedded platforms. Note the 2 key points -

open standard : this means that the specification and documentation of the technology are publicly available and can be accessed by anyone.
cross-platform : this means that it can run on multiple operating systems and hardware architectures without requiring major modifications to the code.

This makes OpenCL a very attractive option for developers who want to write code that can run on a wide range of devices. From Gnuastro’s perspective, this means that we can write code that can run on multiple GPU manufactureres, as well as CPUs and other accelerators. Our GPU kernels will be portable to any system, regardless of its configuration!

Next point to consider is OpenCL is a standard. It is different from CUDA in this regard. CUDA is a framework, whereas OpenCL is a standard. What does this mean?

The OpenCL standard refers to the specification and guidelines set forth by the Khronos Group which is responsible for developing and maintaining the standard. The OpenCL standard defines the API, data types, functions, and programming model that developers must follow when writing code for OpenCL. It is a formal document that ensures uniformity and compatibility across different OpenCL implementations.

OpenCL is not an open-source library! It basically defines how the library should behave(big simplification!).

So what can we do with the standard alone? Not much! We need an implementation of the standard.

This also reminds me of the question I once had - What do you need to create a new programming language? My first guess was a compiler! My thought process was if a program(compiler in this case) can understand my High level language and convert it to corresponding machine code, then I can write programs in that high level language for any task! So all I’d need is a compiler for that language. Its close, but not totally accurate.

You dont actually need a compiler for a new programming language. You ONLY need a specification for it. The specification will define the syntax and semantics(rules) of the language. You only need a compiler when you want to run programs using your language!(what good is a language if you cant run programs using it? haha)

Similaraly OpenCL defines a set of rules which specify how it will behave. But to use OpenCL we need an implementation of this standard.

OpenCL implementations are software packages developed by hardware manufactureres that provide the necessary drivers and runtime libraries for running OpenCL applications on their specific hardware. Each hardware vendor is responsible for creating their own OpenCL implementation that conforms to the OpenCL standard. This means that each implementation may have its own unique features and quirks, but they all adhere to the same standard.

There are many different implementations available for it! (find the full list here or here).

Basically each of the hardware manfacturers provide an implementation of the OpenCL standard for their hardware. This implementation is usually provided as a framework. Depending on what hardware you have on your system, you can choose the corresponding framework to use.

How does OpenCL work?

Here’s waht a typical OpenCL system looks like :

OpenCL programs consist of two parts: host code and device code. The host code is written in C or C++ and runs on the host, while the device code is written in OpenCL C and runs on the device. The host code is responsible for setting up the OpenCL environment, creating the context, compiling the device code, and executing the kernels on the device.

The device code is compiled at runtime by the host code. This means that the host code must be compiled first, and then the device code can be compiled. The host code is compiled using a standard C/C++ compiler, while the device code is compiled using the OpenCL compiler. The OpenCL compiler is provided by the OpenCL implementation and is responsible for compiling the device code into binary code that can be executed on the device.

How does the OpenCL library interact with the hardware? Its made possible through OpenCL-ICD.

OpenCL ICD stands for OpenCL Installable Client Driver. It is a component of the OpenCL

It enables multiple manufacturers OpenCL drivers to coexist on a single system. Instead of having a single monolithic OpenCL driver, an ICD allows different manufactureres (e.g., NVIDIA, AMD, Intel) to provide their own separate OpenCL implementation as dynamically loadable libraries. This means that developers can select the appropriate OpenCL driver at runtime without needing to modify their applications.

The ICD mechanism is crucial for achieving portability and flexibility in developing applications using computational power of various devices from different manufacturers.

OpenCL Programming Model

The Programming Model of OpenCL is very similar to CUDA which I covered in my previous post. However CUDA has a lot of abstraction since it has its own runtime library which communicates with the driver. In OpenCL there’s direct communication with the drivers and the host code is responsible for setting up the environment so its a bit more lower level than CUDA.

Some of the key terms in OpenCL are :

Work Item: Basic unit of work on a compute device
Kernel: The code that runs on a work item (Basically a C function)
Program: Collection of kernels and other functions
Context: The environment where work-items execute (Devices, their memories and command queues)
Command Queue: Queue used by the host to submit work (kernels, memory copies) to the device.

I’ll cover the programming aspect of OpenCL in more detail in my next post.

GPUs and Convolutions in Gnuastro

Labib Asari — Mon, 03 Jul 2023 23:00:00 GMT

Background

This is an overview of what I’ve been upto for the past 2 weeks. Doesn’t go into much technical details and the actual code but just walks through the general idea.

Convolution is a fundamental operation in various domains, such as image processing, signal processing, and deep learning. It is an important module in Gnuastro and is also used as a subroutine in other modules.

Convolutional operations can be broken down into smaller tasks, such as applying the kernel to different portions of the input data. By utilizing multiple threads, each thread can independently process a subset of the input, reducing the overall execution time. This parallelization technique is particularly effective when dealing with large input tensors or performing multiple convolutions simultaneously.

While traditional CPUs (Central Processing Units) excel at performing a wide range of tasks, they are not specifically designed for heavy parallel computations like convolutions. On the other hand, GPUs (Graphics Processing Units) are highly optimized for parallel processing, making them ideal for accelerating convolutional operations.

GPUs vs CPUs Architecture

Cores and Parallelism :

CPUs have fewer, more powerful cores optimized for sequential processing, while GPUs have thousands of smaller cores designed for parallel processing. This parallelism allows GPUs to perform computations on multiple data elements simultaneously, leading to significant speedup in parallelizable tasks like graphics rendering and deep learning.

Memory Hierarchy :

CPUs typically have larger caches and more advanced memory management units (MMUs), focusing on low-latency operations and complex branch prediction. GPUs, prioritize high memory bandwidth and utilize smaller caches to efficiently handle large amounts of data simultaneously, crucial for tasks like image processing and scientific simulations.

Emphasis :

CPUs are designed with an emphasis on executing single threads - very fast. GPUs are designed with an emphasis on executing on executing multiple threads.

Programming Model

For Programming GPUs, several frameworks (high level APIs) are available

CUDA - developed by NVIDIA for its GPUs.
OpenCL - Open Source, Cross Platform parallel programming standard for diverse accelerators.
HIP - developed by AMD, portable.
and many more….

CUDA

The CUDA platform consists of a programming language, a compiler, and a runtime library.

Programming Language - Based on C, has extensions to write code for GPU.
Compiler - Based on clang, offloads host code to system compiler and translates device code into binary code that can be executed on the GPU.
Runtime Library - Provides the necessary functions and tools to manage the execution of the code on the GPU (interacts with the driver).

Note : When we have multiple devices(GPUs, FPGAs, etc) on a single system, which can execute tasks apart from the main CPU, they’re generally referred to as device whereas the main CPU is referred to as host.

CUDA Programs

CUDA programs consists of normal host code along with some kernels. Kernels are like other functions, but when you call a kernel, they’re executed N times parallely by N different CUDA threads, as opposed to only once like normal functions. They’re defined using the __global__ keyword.

Eg :

Normally, we put the above piece of code inside a loop, so all elements are covered.

With GPUs, there’s no need for loops - for N elements, we launch N threads each of which add 1 element at the same time!

CUDA Execution Configuration

Can we launch an arbitrary large number of threads? Technically No

The maximum allowed threads depend on your GPUs compute capability.
But generally it’s so large, it always covers all your elements
For Compute Capability > 3.0
- Max Number of threads : (2^31)(2^16)(2^16)(210) = 2^42!

Threads and Blocks :

All threads are organized into groups called - Block.
All blocks are organized into groups called - Grid.

Blocks and Grids could be a 1D, 2D or 3D structures.

When calling a GPU kernel, we specify the structure of each block, number of blocks, and number of threads/block - This is called the Execution Configuration.

Example :

The above code Launches 32321 = 1024 blocks Each having 1616 = 256 threads Total no. of threads = 1024256.

CUDA Memory Hierarchy

CUDA threads may access data from multiple memory spaces during their execution as illustrated above.

Local memory for each thread.
Shared memory b/w all threads of same block.
Global memory b/w all blocks.

CUDA Hardware abstraction

The entire GPU is divided into several Streaming MultiProcessors (SMs). They have different architecture than a typical CPU core. Each SM has several CUDA cores, which are the actual processing units.

It is designed with SIMT/SIMD philosophy, which allow execution of multiple threads concurrently on them. One Block is executed at a time on a single SM.

CUDA Developing Workflow

Results of Convolution on GPU for Gnuastro

All tests were performed on a system with the following specifications:

CPU :

Intel(R) Core(TM) i5-9300HF CPU @ 2.40GHz
Thread(s) per core: 2
Core(s) per socket: 4
Socket(s): 1
CPU max MHz: 4100.0000
CPU min MHz: 800.0000

GPU :

NVIDIA GeForce GTX 1650
Turing Architecture
Driver Version: 535.54.03
CUDA Version: 12.2
VRAM : 4GB
Compute Capability : 7.5

The input image was a 10k x 20k FITS file with 32-bit floating point values. The kernel was a 3x3 matrix with 32-bit floating point values.

CPU Multi-threaded

GPU

The overall speedups seems to only be 6X but this also counts the time taken to transfer the data from CPU to GPU and back. If we only consider the time taken to perform the convolution, the speedup is around ~700X!.

Creating a new Data Structure for pyGnuastro

Labib Asari — Mon, 19 Jun 2023 23:00:00 GMT

Background

GnuAstro is a powerful and comprehensive library designed to handle various data formats(FITS/TIFF/TXT and more) and perform a wide range of operations, all while maintaining consistency across its entire codebase.

This is done by representing all the data (acquired via input or created internally), regardless of its type, in a single data structure which encompasses the core data as well as metadata. This greatly assists in mainting uniformity. Internally all the data is represented in the form of a C struct : gal_data_t The following image describes how it keeps the core data as well as metadata :

Explaining each attribute of this structure will require a seperate post of itself :). Instead I’ll focus on the main topic here : Since Im creating a python package for Gnuastro, and the gal_data_t is at the heart of this library, How do I represent this complex type in Python?!

Normally we use Classes to define new and complex data types in Python, but hey.. I’m wrapping a C library in Python using the Python-C API. This means I write my wrappers in C!

So the question comes down to how do I create a new type in Python using C language?

Creating New Data Types in Python Without Classes and Objects

Before I continue, I’ve to appreciate Numpy for the incredible peice of software it is, the more I understand it, the more it amazes me.

C is not an Object Oriented Programming Language, but Python is.

In case you didn’t know the most common implementation of Python (the one you most probably have) is written in C! It’s called CPython.

This raises an obvious question, how does Python implement its whole OOP paradigm in C?

This question also answers our question of how to represent gal_data_t in Python, because essentially they’re looking for the same thing.

PyObject is the answer! To the Python interpreter(written in C) all the data types(built in as well as user defined) are of this type!

and what is this PyObject? Its a simple struct in C.

GSoC - its finally here

Labib Asari — Sun, 04 Jun 2023 23:00:00 GMT

What is Open-Source and Gsoc?

Open source software is software with source code that anyone can inspect, modify, and enhance. There are many institutions and individuals who write open software, mainly for research or free deployment purposes. Mostly these softwares, have only a few maintainers, and multiple people, writing and debugging the code, helps a lot. This is where Google Summer of Code GSOC comes into the picture. It is a global, online program focused on bringing new contributors into open source software development. Many organisations float projects for the developers to take over the summer and Google mediates in the process, while also paying the contributors for their work over the summer.

What is my project about?

It has 2 main components :

Create a Python Library for Gnuastro
- Design an error handling mechanism for Gnuastro
- Design corresponding data structures of Gnuastro in Python
- Write wrapper functions to be used in python
Add CUDA support in Gnuastro
- Integrate CUDA with Gnuastro’s build system
- Write GPU kernels for compute heavy and parallelizable operations.

What have I completed till now?

On the Python Library Part :
- Gnuastro now has an error handling mechanism!
- Added error handling in Python package for the 2 existing modules.
- Defined error types for each corresponding error type in C library.
- Implemented Python wrappers for 2 of the C library modules
On the CUDA support part :
- Gnuastro can now build with cuda! this means it already supports GPU computations.
- Added docs for installing, configuring, and testing CUDA
- Added test CUDA kernels and demo programs to test them.
- Implementing CUDA kernel for Convolution operation.

GSoC - Pre Community Bonding

Labib Asari — Sat, 06 May 2023 23:00:00 GMT

What is Open-Source and Gsoc?

My GSoC Journey - Part 4

Jash Shah — Tue, 19 Jul 2022 23:00:00 GMT

Writing the extension modules and Python wrappers for a package is one thing, but a step that is often overlooked is making a build system that complies with the rest of your program, ensures the correct installation based on your dependencies and also is portable enough to be distributable.

I learned these things the hard way in Week 3 and 4, where I went as low level as I could to try to solve all the weird build errors and glitches I had while trying to build a Python Package using GNU Autotools.

Building

As discussed in my last GSoC blog, I was mainly using distutils along with it’s distutils.setup script to take care of all the building and linking required for building the .so (shared object) file required by the Python Interpreter. However, one of my co-mentors brought up a good point that setuptools is the packaging tool that is recommended by PyPA and also using wheels to package the modules instead of the standard setup.py build command.

Hence, Week 3 was spent mainly learning about setuptools and wheels. What Are Python Wheels and Why Should You Care? is a great article to start with Python Wheels. The setuptools documentation is a great place to know about setuptools, if you already know about distutils like me! Luckily, while Setuptools is a “beefier” version of distutils, as it offers better and more packaging utilities, it keeps the same functions, so in terms of code it was just a change of one line for me.

Originally, with distutils, the plan was to have the files related to the Python Package in a separate python/ directory at the root of the Gnuastro source like:

📦python
┣ 📂gnuastro.arithmetic
┃ ┣ 📜arithmetic.c
┃ ┗ 🔧setup.py
┣ 📂gnuastro.cosmology
┃ ┣ 📜cosmology.c
┃ ┗ 🔧setup.py
┣ 📂gnuastro.fits
┃ ┣ 📜fits.c
┃ ┗ 🔧setup.py
┗ 📑Makefile.am

The idea was to have the setup.py script in each folder build that specific extension, and let the Makefile handle the linking. But I soon realized that this was too excessive. A better structure would be:

 📦python
┣ 📂src
┃ ┣ 📜arithmetic.c
┃ ┣ 📜cosmology.c
┃ ┗ 📜fits.c
┣ 📑Makefile.am
┗ 🔧setup.py

Using Autotools to build Python Package

As the name suggests GNUastro is a GNU project, and thus depends on Autotools(Automake and Autoconf and Libtool) for its building and compiling. These are the tools behind the

./configure
make
make check
make install

set of instructions.

Alongwith the setup script, I also added a new file(python.c) to the lib/ directory of Gnuastro. This file basically provides any utility functions I might require while building the Python package. Currently, the file provides type conversion functions, which facilitate converting between Gnuastro and NumPy’s datatypes.

So, what is the difference between your traditional Makefile and using Autotools instead:-

Autoconf easily scans an existing tree to find its dependencies and creates a configure script that will run under almost any kind of shell. The configure script allows the user to control the build behavior (i.e. –with-foo, –without-python, –prefix, –sysconfdir, etc..) as well as doing checks to ensure that the system can compile the program.

Configure generates a config.h file (from a template) which programs can include to work around portability issues. For example, if HAVE_NUMPY is not defined, don’t build the Python package.

Automake provides a short template that describes what programs will be built and what objects need to be linked to build them, thus Makefiles that adhere to GNU coding standards can automatically be created.

My job was to use these tools to also call the setup script for building my Python package.

My approach to building the package using Autotools involved 4 basic steps:

Adding the necessary checks in the configure.ac script.
- Check if a user has Python 3 on their system and get it’s include path i.e. path to Python.h file.
- If Python 3 is found, Check if the user has NumPy on their system and get it’s include path.
- Substitute the include paths as variables to be passed to all Makefile.am's.
Conditionally build the Python package and its utility functions module(lib/python.c) only if the above checks are passed.
Write the Makefile.am in the python/ directory which would handle the build, install, uninstall and clean targets for the Python package.
Re-write the setup.py script to make it more generic, by using the environment variables passed by the configure script instead of hardcoding the include and install paths.
- This also ensures that the Python package building supports VPATH builds, which is another great feature of Autotools. For the uninitiated, VPATH builds are basically a way to separate your source and build tree, so that all the built files (.o, .so, etc) are in a separate directory than your source files but are symlinked to the source tree.

This process took a lot of trial and error, digging into the Autotools(mostly Automake) documentation and playing around with the Makefile.am to get right. But it introduced me to these amazing tools and taught me how to make any scrawny personal project distributable!

Installing

After running,

python3 setup.py build_ext bdist_wheel

the distributable wheel file, with all of the package’s metadata, is created under the dist/ folder. In order to install this file we use pip as follows:

pip install Gnuastro.whl

YES! It is in fact as simple as that!

But there is an issue that I faced here, suppose that a user wants to install the Gnuastro library in their root directory, or to any directory where they dont have privileges. This means they’ll run sudo make install from the root of the source. This cascades to calling the Makefile in the python/ directory with root access as well. However, running pip with sudo access is a big NO, NO. And pip would warn you of that with a warning like:

This is because, Python packages are generally installed at a local level, in the /usr/local directory. However, if you call pip with sudo then it installs the packages in the root directory. To sove this, we use

sudo -u "$SUDO_USER pip install Gnuastro,whl

which basically runs the pip command as the user who called sudo. This will ensure that your package gets installed in the local directory instead of root!

My GSoC Journey - Part 3

Jash Shah — Thu, 30 Jun 2022 23:00:00 GMT

Coding Begins!

So, now that I got to know the Gnuastro community a bit and had discussed the plan of attack with my mentor it was time to start with the actual coding.

Week 1

As planned, I started with the building the extension module for Cosmic Calculator(cosmiccal) library. A simple Python extension Module should be structed as:

The cosmical library was chosen as a starting point because it contained only 6 functions and solely dealt with doubles, ints, and floats. Consequently, there wasn’t yet a requirement for a NumPy Converter. It was pretty straight forward to create wrappers for these functions by following the aforementioned structure. The setup.py script for building and installing these modules was created at the following stage. For this, I followed the Python Extension documentation’s advice and utilised distutils, which offers two crucial functions:

distutils.core.Extension which is used to describe a C/C++ extension.
distutils.core.setup the frontman in actually building and compiling the modules.

After this, the commands to build and install these modules were simply:

python3 setup.py build
python3 setup.py install

Week 2

At our subsequent meeting, my mentor confirmed my work, and we both agreed that the next step should be to write the NumPy converter so that this may be expanded to include the other library modules as well.

Week 2 was a little light on work because I was out of town for a few days. However, the most of my reading time was devoted to learning about the NumPy C-API and how it connected with the Python C-API.

I discovered that a NumPy array’s primary container object was the PyArrayObject, and its PyTypeObject was the PyArray_Type. Therefore, in order for any PyObject to be regarded as a NumPy Array, it has to fulfil these two requirements.

The API itself offered functions that allowed any generic array type data container to be converted into PyArray_Type or one of its subclasses. For creating the converter, I would always turn to these!

My GSoC Journey - Part 2

Jash Shah — Sat, 25 Jun 2022 23:00:00 GMT

Community Bonding Period

The official GSoC docs describes the Community Bonding Period as "The first phase in which you get to know your community and get familiar with their code base and work style." So following this definition, my main goal in these few weeks(20th May - 13th June) was to get to learn the inner workings of Gnuastro and trying to understand the communication within it, by becoming an active part of the community.

Week 1

As I had my end semester exams scheduled from 9th May till 27th May, it was tough for me to be as active as I would like, so this week was majorly spent observing the communication between the different memebers of the community, mainly through our Element(Matrix) Channel. I would also browse through the Savannah pages of Gnuastro to see the current bugs/features in work, while also noting the contributions currenty being made, by reading through the great commit messages.

Week 2/3 - The First Meet with Mentor !!

I dedicated myself to learning about Python Extensions(whose docs are really succinct!) and testing them out by creating some sample modules during this period. It was quite inciteful as it gave me a greater perspective into what the final outcome of my project will be and what flow of work I would have to follow to achieve that.

I was also pretty excited about finally getting to meet my mentor, Mohammad Akhlaghi, who I had only talked to over the mail and IRC, but had been incredibly kind and welcoming. Taking the schedules of all other developers into cosideration as well, we decided to make Tuesdays 1:00PM,CEST(4:30PM IST) as our weekly meet times. The first meet went great, and was really fun(allbeit overwhelming) to meet all the other developers and get an insight into the awesome work everyone’s doing at Gnuastro! I presented a few slides I had prepared to give an overview of my project to the the other developers. Alongwith Mohammad, we were also able to decide a few goals for the next meet, them being:

Learn more about Python Extension Modules and try to extend a simple Gnuastro Library module(cosmiccal in particular) into Python.
Reasearch into the NumPy C-API and how we can go about building a converter between Gnuastro's core data-structure and NumPy’s PyArrayObject.

Conclusion

As mentioned in the afformentioned definition of Community Bonding given by GSoC, I think I managed to get a good insight into the Gnuastro’s community by observing their communications and contributions while also making myself familiar with their work style!