<?xml version="1.0" encoding="utf-8"?>
<?xml-stylesheet type="text/xsl" href="../assets/xml/rss.xsl" media="all"?><rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Universe OpenAstronomy (Posts about gnuastro)</title><link>http://openastronomy.org/Universe_OA/</link><description></description><atom:link href="http://openastronomy.org/Universe_OA/categories/gnuastro.xml" rel="self" type="application/rss+xml"></atom:link><language>en</language><lastBuildDate>Wed, 31 Dec 2025 02:08:40 GMT</lastBuildDate><generator>Nikola (getnikola.com)</generator><docs>http://blogs.law.harvard.edu/tech/rss</docs><item><title>Final GSoC Report</title><link>http://openastronomy.org/Universe_OA/posts/2024/10/20241006_2230_deadspheroid/</link><dc:creator>DeadSpheroid</dc:creator><description>&lt;p class="intro"&gt;In this post, I'll be discussing my GSoC'24 project, the goals set, work done, and future scope&lt;/p&gt;

&lt;h2 id="about-openastronomy"&gt;About OpenAstronomy&lt;/h2&gt;
&lt;!-- TEASER_END --&gt;
&lt;p align="center" width="100%"&gt;
&lt;img alt="OpenAstronomy Logo" src="https://deadspheroid.github.io/my-blog/assets/img/logoOA_svg.png" style="margin-bottom: 0; margin-top: 24px;"&gt;
&lt;/p&gt;
&lt;p&gt;OpenAstronomy is a collaboration between open source astronomy and astrophysics projects to share resources, ideas, and to improve code.&lt;/p&gt;

&lt;p&gt;OpenAstronomy consists of many different projects: astropy, sunpy, stingray, radis&lt;/p&gt;

&lt;p&gt;and of course GNU Astronomy Utilities (Gnuastro)&lt;/p&gt;

&lt;h2 id="about-gnuastro"&gt;About Gnuastro&lt;/h2&gt;
&lt;p align="center" width="100%"&gt;
&lt;img alt="Gnuastro Logo" src="https://deadspheroid.github.io/my-blog/assets/img/gnu-logo.png" style="margin-bottom: 0; margin-top: 24px;"&gt;
&lt;/p&gt;

&lt;p&gt;Gnuastro is an official GNU package that consists of many CLI programs as well as library functions for manipulation and analysis of astronomical data.
Something important to note about Gnuastro is that it is written entirely in C99 and shell script.
The Gnuastro team meets every Tuesday to exchange notes and review progress&lt;/p&gt;

&lt;h2 id="goals-of-the-project"&gt;Goals of the project&lt;/h2&gt;

&lt;p&gt;Going into this project, there were two main goals decided&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Setting up the low level wrapper infrastructure for using OpenCL within Gnuastro, putting minimal requirements on the developers/users to know OpenCL.&lt;/li&gt;
&lt;li&gt;Parallelizing Gnuastro subroutines using the aforementioned wrappers to offload compute-heavy tasks to the GPU.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The first goal deals with how users of the library would interact with the OpenCL modules. Ideally, you would want the users to have no knowledge about OpenCL and only interact with it through GNUAstro.&lt;/p&gt;

&lt;p&gt;The second goal deals with analyzing parts of the Gnuastro library, identifying easy to parallelize sections and writing optimised OpenCL Kernels for them, leveraging the wrapper infrastructure for execution.&lt;/p&gt;

&lt;p&gt;The majority of my work lives &lt;a href="https://github.com/DeadSpheroid/gnuastro/tree/final"&gt;here&lt;/a&gt;&lt;/p&gt;

&lt;h2 id="pre-gsoc"&gt;Pre GSoC&lt;/h2&gt;

&lt;p&gt;Prior to GSoC, my experience consisted mostly of Deep Learning and Natural Language Processing. My knowledge of GPU Processing was limited and naive.&lt;/p&gt;

&lt;p&gt;During the proposal drafting period, candidates were asked to submit a task, which was implementing simple image convolution on both CPU and GPU using OpenCL.&lt;/p&gt;

&lt;h2 id="during-gsoc"&gt;During GSoC&lt;/h2&gt;

&lt;p&gt;Work on GSoC kicked off around May 1st with the first objective being&lt;/p&gt;

&lt;h3 id="build-system-integration"&gt;Build System Integration&lt;/h3&gt;

&lt;p&gt;Now, Gnuastro being a GNU project uses the GNU Build System, GNU Autotools(Autoconf, Automake, Libtool).&lt;/p&gt;

&lt;p&gt;This was a completely new build system for me to work with and I had to get the library to include OpenCL and link against the OpenCL library at compile time.&lt;/p&gt;

&lt;p&gt;Thanks to some helpful pointers from Mohammad, I was able to grasp the working pretty quickly and was able to set up Gnuastro to include and build with OpenCL if it was detected on the system.&lt;/p&gt;

&lt;p&gt;For more information, you can read &lt;a href="https://deadspheroid.github.io/my-blog/post/GettingStarted/"&gt;here&lt;/a&gt;&lt;/p&gt;

&lt;h3 id="wrapper-infrastructure"&gt;Wrapper Infrastructure&lt;/h3&gt;

&lt;p&gt;The next goal was to create wrappers around the OpenCL C API, for various operations(data transfer, launching kernels, querying devices).&lt;/p&gt;

&lt;p&gt;This was done in the form of a new Gnuastro module called &lt;code class="language-plaintext highlighter-rouge"&gt;cl-utils.c&lt;/code&gt; which contained&lt;/p&gt;

&lt;h5 id="initialisation"&gt;Initialisation&lt;/h5&gt;
&lt;p&gt;Functions dealing with initialising OpenCL and creating, destroying OpenCL objects&lt;/p&gt;

&lt;h5 id="data-transfer-functions"&gt;Data Transfer Functions&lt;/h5&gt;
&lt;p&gt;Functions dealing with data transfer to and from the GPU.
This is one of the biggest overheads in GPU Programming. Additionally, OpenCL presented another challenge in the form of transferring structs to the GPU, which was problematic as one of Gnuastro’s most important data structures &lt;code class="language-plaintext highlighter-rouge"&gt;gal_data_t&lt;/code&gt; could not be directly transferred.&lt;/p&gt;

&lt;p&gt;So, this module provides two ways of transferring data to the GPU&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;OpenCL Buffers
and&lt;/li&gt;
&lt;li&gt;OpenCL SVM&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Buffers are intended to be used for simple data structures, while OpenCL SVM is better with more complex data structures involving internal pointers.&lt;/p&gt;

&lt;p&gt;The interface looks something like this&lt;/p&gt;
&lt;pre&gt;&lt;code class="language-C"&gt;void
gal_cl_write_to_device (cl_mem *buffer, void *mapped_ptr,
cl_command_queue command_queue);

void *
gal_cl_read_to_host (cl_mem buffer, size_t size,
cl_command_queue command_queue);
.
.
.
gal_data_t *
gal_cl_alloc_svm (size_t size_of_array, size_t size_of_dsize,
cl_context context, cl_command_queue command_queue);

void
gal_cl_map_svm_to_cpu (cl_context context, cl_command_queue command_queue,
void *svm_ptr, size_t size);
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;For more information on the two, and a comparison see &lt;a href="https://deadspheroid.github.io/my-blog/post/ExploringFurther/"&gt;here&lt;/a&gt;&lt;/p&gt;

&lt;h5 id="executing-kernels"&gt;Executing Kernels&lt;/h5&gt;
&lt;p&gt;Now, the main code running on the GPU is the OpenCL Kernel, usually defined in a .cl file and compiled at runtime.&lt;/p&gt;

&lt;p&gt;The idea when making this module, was to keep the interface as similar to the original pthreads &lt;code class="language-plaintext highlighter-rouge"&gt;gal_threads_spin_off()&lt;/code&gt; interface that Gnuastro already had. So i created a &lt;code class="language-plaintext highlighter-rouge"&gt;gal_cl_threads_spinoff()&lt;/code&gt; function, taking information like the kernel filepath, number of inputs, list of inputs, number of threads executed and more.&lt;/p&gt;

&lt;pre&gt;&lt;code class="language-C"&gt;typedef struct clprm
{
char               *kernel_path; /* Path to kernel.cl file */
char               *kernel_name; /* Name of __kernel function */
char             *compiler_opts; /* Additional compiler options */
cl_device_id          device_id; /* Device to be targeted */
cl_context              context; /* Context of OpenCL in use */
int             num_kernel_args; /* Number of total kernel arguments */
int                num_svm_args; /* Number of SVM args*/
void              **kernel_args; /* Array of pointers to kernel args */
size_t       *kernel_args_sizes; /* Sizes of non SVM args */
int          num_extra_svm_args; /* Number of implicit SVM args */
void           **extra_svm_args; /* Array of pointers to these args */
int                    work_dim; /* Work dimension of job - 1,2,3 */
size_t        *global_work_size; /* Array of global sizes of size work_dim */
size_t         *local_work_size; /* Array of local sizes of size work_dim */
} clprm;
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;These wrappers were not developed all at once, but rather in conjunction with the next section, writing wrappers as and when I needed them.&lt;/p&gt;

&lt;h3 id="parallelized-subroutines"&gt;Parallelized Subroutines&lt;/h3&gt;
&lt;p&gt;To achieve the goal of GPU acceleration, first we needed to identify parts of the library that could be parallelized.
Its important to note that not everything can be parallelized, and just because something can be, doesnt mean it should be.&lt;/p&gt;

&lt;p&gt;The most obvious candidate for this of course was 2D Image Convolution, already implemented in Gnuastro in the &lt;code class="language-plaintext highlighter-rouge"&gt;astconvolve&lt;/code&gt; module.&lt;/p&gt;

&lt;h5 id="same-code-on-cpu-and-gpu"&gt;Same code on CPU and GPU&lt;/h5&gt;
&lt;p&gt;The initial idea was to have the exact same code running on both the CPU(via pthread) and the GPU(via OpenCL). This is possible because OpenCL Kernels are based on OpenCL C which is a variant(kind of a subset) of C99.&lt;/p&gt;

&lt;p&gt;This is because Gnuastro is a “minimal dependencies” package and having two separate implementations would greatly overcomplicate the codebase.&lt;/p&gt;

&lt;p&gt;However for the time being, this idea was shelved, till I had a working implementation of convolution in OpenCL.&lt;/p&gt;

&lt;h5 id="convolution"&gt;Convolution&lt;/h5&gt;
&lt;p&gt;I got to work creating a new module &lt;code class="language-plaintext highlighter-rouge"&gt;cl-convolve.c&lt;/code&gt; containing the new implementation of convolution &lt;code class="language-plaintext highlighter-rouge"&gt;gal_convolve_cl()&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;The exact code can be viewed &lt;a href="https://github.com/DeadSpheroid/gnuastro/blob/final/lib/cl-convolve.c"&gt;here&lt;/a&gt;, but in short&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Transfer input, kernel and output images to GPU&lt;/li&gt;
&lt;li&gt;Spin off a thread for each pixel in the input, convolving that particular pixel.&lt;/li&gt;
&lt;li&gt;Copy the output image back to CPU&lt;/li&gt;
&lt;/ol&gt;

&lt;h5 id="additional-features"&gt;Additional features&lt;/h5&gt;
&lt;p&gt;However, Gnuastro doesn’t use a &lt;strong&gt;simple 2D convolution&lt;/strong&gt;, it also performs an additional three important tasks:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Edge Correction:&lt;/strong&gt; Pixels near the edge use a different kernel weight than others. More info &lt;a href="https://www.gnu.org/savannah-checkouts/gnu/gnuastro/manual/html_node/Edges-in-the-spatial-domain.html"&gt;here&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;NAN Checking:&lt;/strong&gt; Often, images captured by astronomical cameras, have missing pixels(represented as NANs). These pixels are to be ignored.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Channels:&lt;/strong&gt; Cameras use multiple different sensors to capture images, and convolution should not mix pixels from different sensors. For a better idea, read &lt;a href="https://www.gnu.org/savannah-checkouts/gnu/gnuastro/manual/html_node/Tessellation.html"&gt;Gnuastro’s explanation&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The first two were rather easy to implement, but the third was a bit troublesome, especially because the existing implementation of gnuastro was complex and hard to understand.&lt;/p&gt;

&lt;p&gt;Eventually however, with a little bit of math it was possible, and the final kernel looked like &lt;a href="https://github.com/DeadSpheroid/gnuastro/blob/4442a544db5d33d64290ac0b15a97bd627ad6335/bin/convolve/astconvolve-conv.cl"&gt;this&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;After these parts were completed, now, all that was left was to actually integrate it properly with Gnuastro.&lt;/p&gt;

&lt;h5 id="optimised-convolution"&gt;Optimised Convolution&lt;/h5&gt;
&lt;p&gt;The power of GPUs comes not from the many threads that are launched, but rather from the many optimisations possible, from organising threads into blocks, to special kinds of memory. I decided to try optimising Convolution based on Labeeb’s suggestion of using shared memory.&lt;/p&gt;

&lt;p&gt;However most of the optimisation out there are for CUDA, not OpenCL, but the principles in question were the same. Thanks to &lt;a href="https://www.evl.uic.edu/sjames/cs525/final.html"&gt;this article&lt;/a&gt;, I was able to implement an optimised 2Dconvolution kernel in OpenCL.&lt;/p&gt;

&lt;p&gt;The results of the optimisation were surprisingly positive:
For a 5000 x 5000 image, times recorded for the convolution operation(excluding data reading/writing in seconds were)&lt;/p&gt;

&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th style="text-align: left;"&gt; &lt;/th&gt;
&lt;th style="text-align: center;"&gt;Pthread&lt;/th&gt;
&lt;th style="text-align: center;"&gt;OpenCL-CPU&lt;/th&gt;
&lt;th style="text-align: center;"&gt;OpenCL-GPU&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td style="text-align: left;"&gt;w/out optimisations&lt;/td&gt;
&lt;td style="text-align: center;"&gt;1.014374&lt;/td&gt;
&lt;td style="text-align: center;"&gt;0.918015&lt;/td&gt;
&lt;td style="text-align: center;"&gt;&lt;strong&gt;0.025869&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style="text-align: left;"&gt;w/ optimisations&lt;/td&gt;
&lt;td style="text-align: center;"&gt;1.053622&lt;/td&gt;
&lt;td style="text-align: center;"&gt;0.326756&lt;/td&gt;
&lt;td style="text-align: center;"&gt;&lt;strong&gt;0.004184&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;Thats a speedup of &lt;strong&gt;~6.2 times&lt;/strong&gt; over the non optimised GPU run, and &lt;strong&gt;~242 times&lt;/strong&gt; over the existing pthread implementation in Gnuastro!&lt;/p&gt;

&lt;p&gt;Further optimisations are possible using special native functions like MUL24 and constant memory. But the details of those and how these optimisation work is a topic for a separate post.&lt;/p&gt;

&lt;h5 id="revisiting-same-code-on-cpu-vs-gpu"&gt;Revisiting Same Code on CPU vs GPU&lt;/h5&gt;
&lt;p&gt;After a discussion, it was decided that the best path forward for OpenCL in Gnuastro would be to completely replace the existing pthread implementation.&lt;/p&gt;

&lt;p&gt;In essence, the existing “convoluted” convolution implementation would be replaced with my new one, allowing the same code to be ran in 3 different ways:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;With OpenCL on the GPU&lt;/li&gt;
&lt;li&gt;With OpenCL on the CPU&lt;/li&gt;
&lt;li&gt;With GCC+Pthreads on the CPU&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This decision was made to adhere to the Gnuastro philosophy of “Minimal Dependencies” so the user does not have to install many packages just to use the library.&lt;/p&gt;

&lt;p&gt;It was challenging, owing to the different styles in which we write code for a CPU device versus a GPU device. But I managed to get a partially working version using some C macros here and there to do so. It still fails some Gnuastro tests, which is yet to be resolved.&lt;/p&gt;

&lt;p&gt;However, doing so prevents the library from utilising the full power of GPUs with several GPU specific optimisations seen previously.&lt;/p&gt;

&lt;h5 id="using-the-opencl-modules-in-your-program"&gt;Using the OpenCL modules in your program&lt;/h5&gt;
&lt;p&gt;Finally, when a user wants to use Gnuastro’s OpenCL capabilities within their own programs, the flow followed would look like:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Intialize OpenCL&lt;/li&gt;
&lt;li&gt;Transfer Input to Device&lt;/li&gt;
&lt;li&gt;Write an OpenCL Kernel&lt;/li&gt;
&lt;li&gt;Spinoff Threads&lt;/li&gt;
&lt;li&gt;Copy Output back to Host&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Lets take an example where we need to simply add two fits images.&lt;/p&gt;

&lt;h6 id="initialize-opencl"&gt;Initialize OpenCl&lt;/h6&gt;
&lt;pre&gt;&lt;code class="language-C"&gt;  cl_context context;
cl_platform_id platform_id;
cl_device_id device_id;

gal_cl_init (CL_DEVICE_TYPE_GPU, &amp;amp;context, &amp;amp;platform_id, &amp;amp;device_id);
cl_command_queue command_queue
= gal_cl_create_command_queue (context, device_id);
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This initializes and OpenCL context, among other objects for use in future function calls.&lt;/p&gt;

&lt;h6 id="transfer-input-to-device"&gt;Transfer Input to Device&lt;/h6&gt;
&lt;p&gt;Make use of &lt;code class="language-plaintext highlighter-rouge"&gt;gal_cl_copy_data_to_gpu()&lt;/code&gt; to transfer the loaded fits files to the GPU, passing the previously initialized context and command queue. Make sure the command queue finishes before proceeding ahead through &lt;code class="language-plaintext highlighter-rouge"&gt;gal_cl_finish_queue()&lt;/code&gt;&lt;/p&gt;

&lt;pre&gt;&lt;code class="language-C"&gt;  gal_data_t *input_image1_gpu
= gal_cl_copy_data_to_gpu (context, command_queue, input_image1);
gal_data_t *input_image2_gpu
= gal_cl_copy_data_to_gpu (context, command_queue, input_image2);
gal_data_t *output_image_gpu
= gal_cl_copy_data_to_gpu (context, command_queue, output_image);

gal_cl_finish_queue (command_queue);
&lt;/code&gt;&lt;/pre&gt;

&lt;h6 id="write-an-opencl-kernel"&gt;Write an OpenCL Kernel&lt;/h6&gt;
&lt;p&gt;First, any custom structs you use, must be defined in the kernel, here we define gal_data_t.&lt;/p&gt;

&lt;p&gt;Then, you create the “per thread” function that will be executed, prefixed by &lt;code class="language-plaintext highlighter-rouge"&gt;__kernel&lt;/code&gt; and always returning &lt;code class="language-plaintext highlighter-rouge"&gt;void&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;In the arguments, mention the pointers to the inputs/outputs, as well as a &lt;code class="language-plaintext highlighter-rouge"&gt;__global&lt;/code&gt; identifier, since your input is acessible by all threads.&lt;/p&gt;

&lt;p&gt;Make use of OpenCl’s &lt;code class="language-plaintext highlighter-rouge"&gt;get_global_id(0)&lt;/code&gt; to get the thread id along the 0th dimension.&lt;/p&gt;

&lt;p&gt;Perform the core operation of your program.&lt;/p&gt;

&lt;p&gt;Putting it all together, it looks like this:&lt;/p&gt;

&lt;pre&gt;&lt;code class="language-C"&gt;typedef struct  __attribute__((aligned(4))) gal_data_t
{
/* Basic information on array of data. */
void *restrict array; /* Array keeping data elements.               */
uchar type;         /* Type of data (see 'gnuastro/type.h').      */
size_t ndim;          /* Number of dimensions in the array.         */
size_t *dsize;        /* Size of array along each dimension.        */
size_t size;          /* Total number of data-elements.             */
.
.
.
/* Pointers to other data structures. */
struct gal_data_t *next;  /* To use it as a linked list if necessary.   */
struct gal_data_t *block; /* 'gal_data_t' of hosting block, see above.  */
} gal_data_t;

__kernel void
add(__global gal_data_t *input_image1,
__global gal_data_t *input_image2,
__global gal_data_t *output_image)
{
int id = get_global_id(0);

float *input_array1 = (float *)input_image1-&amp;gt;array;
float *input_array2 = (float *)input_image2-&amp;gt;array;
float *output_array = (float *)output_image-&amp;gt;array;

output_array[id] = input_array1[id] + input_array2[id];
return;
}
&lt;/code&gt;&lt;/pre&gt;

&lt;h6 id="spin-off-threads"&gt;Spin Off Threads&lt;/h6&gt;
&lt;p&gt;Make use of the &lt;code class="language-plaintext highlighter-rouge"&gt;clprm&lt;/code&gt; struct defined in &lt;code class="language-plaintext highlighter-rouge"&gt;gnuastro/cl-utils.h&lt;/code&gt; to group all the relevant parameters.&lt;/p&gt;

&lt;p&gt;&lt;code class="language-plaintext highlighter-rouge"&gt;Kernel Path&lt;/code&gt; is the filepath to the OpenCL Kernel you just wrote.&lt;/p&gt;

&lt;p&gt;&lt;code class="language-plaintext highlighter-rouge"&gt;Kernel Name&lt;/code&gt; is the name of the function you defined with &lt;code class="language-plaintext highlighter-rouge"&gt;__kernel&lt;/code&gt; earlier.&lt;/p&gt;

&lt;p&gt;&lt;code class="language-plaintext highlighter-rouge"&gt;Compiler Options&lt;/code&gt; is a string of any special compiler options like macros/debug options you wish to use for the kernel.&lt;/p&gt;

&lt;p&gt;&lt;code class="language-plaintext highlighter-rouge"&gt;Device Id &amp;amp; Context&lt;/code&gt; are the objects intialized in the first step.&lt;/p&gt;

&lt;p&gt;&lt;code class="language-plaintext highlighter-rouge"&gt;Number of Kernel Arguments&lt;/code&gt; is the number of kernel arguments.&lt;/p&gt;

&lt;p&gt;&lt;code class="language-plaintext highlighter-rouge"&gt;Number of SVM Arguments&lt;/code&gt; is the number of arguments that use SVM(all the gal_data_t’s)&lt;/p&gt;

&lt;p&gt;&lt;code class="language-plaintext highlighter-rouge"&gt;Kernel Arguments&lt;/code&gt; is an array to void pointers of kernel arguments.&lt;/p&gt;

&lt;p&gt;&lt;code class="language-plaintext highlighter-rouge"&gt;Number of Extra SVM Arguments&lt;/code&gt; is the number of arguments that are implicitly referenced with a struct. For example, &lt;code class="language-plaintext highlighter-rouge"&gt;input_image1_gpu&lt;/code&gt; is directly referenced as a kernel argument, but the &lt;code class="language-plaintext highlighter-rouge"&gt;input_image1_gpu-&amp;gt;array&lt;/code&gt; is implicitly referenced.&lt;/p&gt;

&lt;p&gt;&lt;code class="language-plaintext highlighter-rouge"&gt;Extra SVM Arguments&lt;/code&gt; is an array of void pointers to the aforementioned special arguments.&lt;/p&gt;

&lt;p&gt;&lt;code class="language-plaintext highlighter-rouge"&gt;Work Dim&lt;/code&gt; is the number of dimensions of the threads (1, 2, 3)
For example, an array would have 1 dimension(0,1,2,…34,35,36) x
an image would have 2 dimensions(0:0, 0:1, 1:0, 1:1,….) x:y
a volume would have 3 dimensions.&lt;/p&gt;

&lt;p&gt;&lt;code class="language-plaintext highlighter-rouge"&gt;Global Work Size&lt;/code&gt; is the total number of threads spun off&lt;/p&gt;

&lt;p&gt;&lt;code class="language-plaintext highlighter-rouge"&gt;Local Work Size&lt;/code&gt; is the number of threads in a block on one GPU core. Leaving it blank lets the device choose this number.&lt;/p&gt;

&lt;pre&gt;&lt;code class="language-C"&gt;
clprm *sprm = (clprm *)malloc (sizeof (clprm));

void *kernel_args[] = { (void *)input_image1_gpu, (void *)input_image2_gpu,
(void *)output_image_gpu };

void *svm_ptrs[]
= { (void *)input_image1_gpu-&amp;gt;array, (void *)input_image2_gpu-&amp;gt;array,
(void *)output_image_gpu-&amp;gt;array };

size_t numactions = input_image1-&amp;gt;size;

sprm-&amp;gt;kernel_path = "./lib/kernels/add.cl";
sprm-&amp;gt;kernel_name = "add";
sprm-&amp;gt;compiler_opts = "";
sprm-&amp;gt;device_id = device_id;
sprm-&amp;gt;context = context;
sprm-&amp;gt;num_kernel_args = 3;
sprm-&amp;gt;num_svm_args = 3;
sprm-&amp;gt;kernel_args = kernel_args;
sprm-&amp;gt;num_extra_svm_args = 3;
sprm-&amp;gt;extra_svm_args = svm_ptrs;
sprm-&amp;gt;work_dim = 1;
sprm-&amp;gt;global_work_size = &amp;amp;numactions;
sprm-&amp;gt;local_work_size = NULL;

gal_cl_finish_queue (command_queue);
&lt;/code&gt;&lt;/pre&gt;
&lt;h6 id="copy-output-back-to-host"&gt;Copy Output back to Host&lt;/h6&gt;
&lt;pre&gt;&lt;code class="language-C"&gt;gal_cl_read_data_to_cpu(context, command_queue, output_image_gpu);
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The complete program can be accessed &lt;a href="https://github.com/DeadSpheroid/gnuastro/blob/final/cl-example-add-fits.c"&gt;here&lt;/a&gt;&lt;/p&gt;

&lt;h2 id="post-gsoc"&gt;Post GSoC&lt;/h2&gt;
&lt;p&gt;Now, that the wrapper infrastructure is set up and convolution is implemented, whats left is to test the implementation against real life scenarios to make sure it lives up to the expectations of the Gnuastro users.
We also need to come up with a consistent way to execute the same kernel on both OpenCL and GCC, as mentioned earlier.&lt;/p&gt;

&lt;p&gt;Additionally, now that work on one module is complete, it opens the scope for more modules to be implemented on the GPU (like statistics, interpolation and more)&lt;/p&gt;

&lt;h2 id="acknowledgements"&gt;Acknowledgements&lt;/h2&gt;
&lt;p&gt;GSoC has been an incredible learning experience for me both from a technical view and from a personal view.&lt;/p&gt;

&lt;p&gt;On the technical side, I learned a lot about one of my favourite domains in Low Level Programming, GPU Programming and my understanding of how to write libraries that are easy to use, performant and above and all else, FOSS, improved tremendously. It’s one thing when you learn and write code for your own personal projects, but it’s a completely different experience contributing to something like Gnuastro.&lt;/p&gt;

&lt;p&gt;On the personal side, the weekly meetings with the Gnuastro team were always extremely engaging and i got to learn a lot from the team, Giacomo’s work on Astrometry, Alvaro’s work on Deconvolution and Ronald too. Their feedback on stuff like debugging using valgrind/gdb and references to other projects using OpenCL, alongside other topics has been invaluable.&lt;/p&gt;

&lt;p&gt;Above and all else im thankful to my mentor &lt;a href="https://akhlaghi.org/"&gt;Mohammad Akhlagi&lt;/a&gt;. Its been amazing getting to interact with someone so experienced and I learned a lot from him, ranging from Astronomy to Hacking the GNU C Library. He was always patient and understanding of my other responsibilities and allowed me to work at my own pace. I’m grateful to him for the opportunity to be a part of the Gnuastro community.&lt;/p&gt;

&lt;p&gt;Finally, I can’t explain how indebited I am to my mentor &lt;a href="https://www.linkedin.com/in/labib-asari/?originalSubdomain=in"&gt;Labeeb Asari&lt;/a&gt;. His knowledge about GPU Programming has been vital to my work on this project and I’m grateful to him for introducing me to the Gnuastro team. From &lt;a href="https://github.com/ProjectX-VJTI"&gt;Project X&lt;/a&gt;, to GSoC to college in general, he has been a big help in everything I’ve done and im glad to have him as a mentor and friend.&lt;/p&gt;

&lt;p&gt;A huge thank you to the Google Summer of Code Team for undertaking this wonderful initiative and I hope they continue this program in future years.&lt;/p&gt;</description><category>gnuastro</category><guid>http://openastronomy.org/Universe_OA/posts/2024/10/20241006_2230_deadspheroid/</guid><pubDate>Sun, 06 Oct 2024 21:30:00 GMT</pubDate></item><item><title>Towards New Speeds</title><link>http://openastronomy.org/Universe_OA/posts/2024/07/20240728_2230_deadspheroid/</link><dc:creator>DeadSpheroid</dc:creator><description>&lt;p class="intro"&gt;In this post, I hope to illustrate how GPU's actually accelerate parallel operations&lt;/p&gt;

&lt;h2 id="lets-get-started"&gt;Let’s get started&lt;/h2&gt;
&lt;!-- TEASER_END --&gt;
&lt;p&gt;Firstly, how do GPU’s really work? Well, at a lower level, the architecture looks something like this…&lt;/p&gt;

&lt;p align="center" width="100%"&gt;
&lt;img alt="GPU Architecture" src="https://deadspheroid.github.io/my-blog/assets/img/gpu-arch.png" style="margin-bottom: 0; margin-top: 24px;"&gt;
&lt;/p&gt;

&lt;p&gt;Focus on the differences between the two, CPUs have few, very highly specialised and refined cores.
GPUs on the other hand have hundreds of more primitive, yet powerful cores.&lt;/p&gt;

&lt;p&gt;This is the reason why GPUs can’t do I/O and stuff, its because they are meant solely for mathematical operations.&lt;/p&gt;

&lt;h2 id="but-whats-the-point-of-so-many-cores"&gt;But what’s the point of so many cores??&lt;/h2&gt;
&lt;p&gt;This is where SIMD or Single Instruction Multiple Data processing comes in handy. See, many operations(image and volume ones notoriously) are extremely taxing for the CPU to perform.
Imagine being part of a team of 8/16 people, stamping a sheet of paper, except you have 10^6 sheets to stamp.
Even if you took 1 ms/sheet, it would still take you insanely long to finish your jobs. And I mean, you have better things to do right?&lt;/p&gt;

&lt;p align="center" width="100%"&gt;
&lt;img alt="OpenCL Map/Unmap Buffers" src="https://deadspheroid.github.io/my-blog/assets/img/simd.png" style="margin-bottom: 0; margin-top: 24px;"&gt;
&lt;/p&gt;

&lt;p&gt;Well what if you could hire 10^3 people for cheap and give each of them a sheet of paper to stamp? Wouldn’t that greatly speed things up?
This is the core idea behind SIMD processing, you have an operation that is to be done thousands of times, over and over again, just on different data.&lt;/p&gt;

&lt;p&gt;So you give each GPU core a part of the data and let it do the job, since the GPU has so many cores, it’s not really a problem.&lt;/p&gt;

&lt;h2 id="so-whats-the-catch"&gt;So whats the catch?&lt;/h2&gt;
&lt;p&gt;Continuing the stamping sheet analogy, giving the sheets to 10^3 workers is challenging and time-consuming. In other words, data transfer is a problem, since GPU VRAM is separate from CPU RAM&lt;/p&gt;

&lt;p&gt;Additionally, parallel programming forces you to think in an additional dimension, because your code is being executed 100s of times at the same time. This makes writing efficient kernels difficult, since branching is frowned on at the GPU, and you need some way of preventing data races.&lt;/p&gt;

&lt;p&gt;That is to say nothing of the increased power consumption and high cost of hardware.&lt;/p&gt;

&lt;p&gt;Despite all of this however, GPUs are still heavily favoured, because the speed-up they offer greatly outweights the rest.&lt;/p&gt;

&lt;h2 id="well-how-do-i-use-my-gpu"&gt;Well, how do I use my GPU?&lt;/h2&gt;
&lt;p&gt;Let’s take an example, hopefully youre already familiar with image convolution. If not, the image below explains it well&lt;/p&gt;

&lt;p align="center" width="100%"&gt;
&lt;img alt="OpenCL Map/Unmap Buffers" src="https://deadspheroid.github.io/my-blog/assets/img/convol.png" style="margin-bottom: 0; margin-top: 24px;"&gt;
&lt;/p&gt;

&lt;p&gt;For each iteration, we center the kernel over a pixel, multiply the overlapping values, add them up and then (optionally)divide by sum of kernel values.
On CPU? For an m x n image and a k x k kernel, this is an O(mn * k^2) operation, meaning the time taken for convolution for a given data size increases tremendously.&lt;/p&gt;

&lt;p&gt;This can be greatly lessened using a GPU.&lt;/p&gt;

&lt;p&gt;But for that, we need to first identify the stamping task here, the tedious computation which is easy to do, but time consuming.&lt;/p&gt;

&lt;h2 id="image-convolution"&gt;Image Convolution&lt;/h2&gt;
&lt;p&gt;If you thought of the matmul operation happening at each pixel, you’d be correct!
This is one of the easiest ways to parallelise convolution. We’re repeatedly performing matrix multiplication, with different pixels at the center each time&lt;/p&gt;

&lt;p&gt;Therefore SIMD can be applied here, the instruction being matmul and the data being all the pixels.&lt;/p&gt;

&lt;p&gt;On a lower level, this is represented by the diffferent &lt;code class="language-plaintext highlighter-rouge"&gt;thread_id&lt;/code&gt; given to each thread on the GPU. This id represents a one to one mapping of an integer to input data elements.
Incase of Images, it is the pixel number that is at the center.&lt;/p&gt;

&lt;p&gt;Here’s an example of how convolution can be parallelized, note that this is not the most efficient and is nowhere near perfect, but it is simple enough to understand.&lt;/p&gt;

&lt;pre&gt;&lt;code class="language-C"&gt;__kernel void
convolution(
__global float *image_array, __global size_t *image_dsize,
__global float *kernell_array, __global size_t *kernell_dsize,

__global float *output) {

/* get the image and kernel size */
int image_height = image_dsize[0];
int image_width = image_dsize[1];
int kernell_height = kernell_dsize[0];
int kernell_width = kernell_dsize[1];

/* get the local group id */
int id = get_global_id(0);
int row = id / image_width;
int col = id % image_width;

if (row &amp;lt; image_height &amp;amp;&amp;amp; col &amp;lt; image_width) {

float sum = 0;
/* matmul operation as normal*/
for (int y = -kernell_height / 2; y &amp;lt;= kernell_height / 2; y++) {
for (int x = -kernell_width / 2; x &amp;lt;= kernell_width / 2; x++) {
if (row + y &amp;gt;= 0 &amp;amp;&amp;amp; row + y &amp;lt; image_height &amp;amp;&amp;amp; col + x &amp;gt;= 0 &amp;amp;&amp;amp;
col + x &amp;lt; image_width) {
sum += (image_array[(row + y) * image_width + col + x] *
kernell_array[(y + kernell_height / 2) * kernell_width + x +
kernell_width / 2]);
}
}
}
output[row * image_width + col] = sum;
}
}

&lt;/code&gt;&lt;/pre&gt;
&lt;h2 id="a-new-outlook"&gt;A new outlook&lt;/h2&gt;
&lt;p&gt;Of course, the world would be a lovely place if everything could be parallelised as easily as this. Realistically, parallelizing these operations is hard, but you have tools like local groups, barriers, etc to help you out!&lt;/p&gt;</description><category>gnuastro</category><guid>http://openastronomy.org/Universe_OA/posts/2024/07/20240728_2230_deadspheroid/</guid><pubDate>Sun, 28 Jul 2024 21:30:00 GMT</pubDate></item><item><title>Exploring OpenCL memory management</title><link>http://openastronomy.org/Universe_OA/posts/2024/07/20240713_2230_deadspheroid/</link><dc:creator>DeadSpheroid</dc:creator><description>&lt;p class="intro"&gt;In this post, I hope to give a high level understanding of OpenCL's Memory Mechanisms&lt;/p&gt;

&lt;h2 id="the-basics"&gt;The basics&lt;/h2&gt;
&lt;!-- TEASER_END --&gt;
&lt;p&gt;Firstly, its important to have a basic understanding of the hardware involved. Keeping it simple, each OpenCL device represents a different set of hardware, each with its own RAM.
My own laptop has a 16GB CPU RAM, and 6GB VRAM&lt;/p&gt;

&lt;p&gt;Now, at the heart of C, we have pointers, without them well, you can’t really get much done in C. The pointers we convetionally use are pointers to CPU RAM.&lt;/p&gt;

&lt;p&gt;So what would happen if you try to pass a CPU Pointer to the GPU?
Well, of course, it wont work, the GPU simply segfaults, as it cannot understand the pointer given to it.
But we still need to use pointers, we can’t just abandon them. So how do we do this?&lt;/p&gt;

&lt;h2 id="buffers"&gt;Buffers&lt;/h2&gt;

&lt;p align="center" width="100%"&gt;
&lt;img alt="OpenCL Map/Unmap Buffers" src="https://deadspheroid.github.io/my-blog/assets/img/opencl-map.png" style="margin-bottom: 0; margin-top: 24px;"&gt;
&lt;/p&gt;

&lt;p&gt;At the simplest level, we have OpenCL Buffers. These buffers are chunks of memory allocated on the OpenCL device as well as on host memory.&lt;/p&gt;

&lt;pre&gt;&lt;code class="language-C"&gt;cl_mem clCreateBuffer(
cl_context context,
cl_mem_flags flags,
size_t size,
void *host_ptr,
cl_int *errcode_ret)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Creating a buffer is easy, its figuring out what kind of buffer you need that is important.&lt;/p&gt;

&lt;p&gt;Broadly speaking there are 3 types of buffers based on the &lt;code class="language-plaintext highlighter-rouge"&gt;flags&lt;/code&gt; passed:&lt;/p&gt;

&lt;h4 id="cl_mem_use_host_ptr"&gt;CL_MEM_USE_HOST_PTR&lt;/h4&gt;
&lt;p&gt;This tells OpenCL to use the host pointer provided as the underlying memory on host.&lt;/p&gt;

&lt;h4 id="cl_mem_copy_host_ptr"&gt;CL_MEM_COPY_HOST_PTR&lt;/h4&gt;
&lt;p&gt;This flag tells OpenCL to make a new buffer and fill it with the memory pointed to by host pointer.&lt;/p&gt;

&lt;h4 id="cl_mem_alloc_host_ptr"&gt;CL_MEM_ALLOC_HOST_PTR&lt;/h4&gt;
&lt;p&gt;This one is the same as CL_MEM_USE_HOST_PTR, but the allocation of host pointer is also done by OpenCL&lt;/p&gt;

&lt;p&gt;But which one should you use?
Well, if you desire a zero copy buffer, i.e. create a buffer without copying memory, especially memory on host, then
CL_MEM_USE_HOST_PTR(if you have the memory already initialised)
or
CL_MEM_ALLOC_HOST_PTR(if you plan to initialise the buffer afterward)
The concept of a zero copy buffer is super helpful when you are targeting the same host CPU as an OpenCL device.&lt;/p&gt;

&lt;p&gt;I mean, you already have the data in CPU RAM, why would you make another copy in CPU RAM by creating a new buffer?&lt;/p&gt;

&lt;h2 id="literacy-for-buffers"&gt;Literacy for Buffers&lt;/h2&gt;
&lt;p align="center" width="100%"&gt;
&lt;img alt="OpenCL Map/Unmap Buffers" src="https://deadspheroid.github.io/my-blog/assets/img/opencl-mem.png" style="margin-bottom: 0; margin-top: 24px;"&gt;
&lt;/p&gt;

&lt;p&gt;The above is enough when you are operating on the same device, which is seldom the case with OpenCL.
But when you work with GPUs, you need to get the memory into GPU VRAM somehow.&lt;/p&gt;

&lt;p&gt;And this is impossible(maybe) without copying the data over.&lt;/p&gt;

&lt;p&gt;So how do you copy the data over to GPU VRAM?
Well, after allocating a buffer as seen before, OpenCL will try to recreate the host side buffer on the device as well.&lt;/p&gt;

&lt;p&gt;But when you update the host side buffer(like reading in input), youd want it to reflect on device as well.
Similary, when your device is done processing, you need to get the output from device memory to host memory.&lt;/p&gt;

&lt;p&gt;There are two main ways to do this:&lt;/p&gt;
&lt;h4 id="readwrite-buffer"&gt;Read/Write Buffer&lt;/h4&gt;
&lt;p&gt;You have a buffer on host memory and on device memory that mirror each other.&lt;/p&gt;

&lt;pre&gt;&lt;code class="language-C"&gt;cl_int clEnqueueReadBuffer(
cl_command_queue command_queue,
cl_mem buffer,
cl_bool blocking_read,
size_t offset,
size_t size,
void* ptr,
cl_uint num_events_in_wait_list,
const cl_event* event_wait_list,
cl_event* event);
&lt;/code&gt;&lt;/pre&gt;

&lt;pre&gt;&lt;code class="language-C"&gt;cl_int clEnqueueWriteBuffer(
cl_command_queue command_queue,
cl_mem buffer,
cl_bool blocking_write,
size_t offset,
size_t size,
const void* ptr,
cl_uint num_events_in_wait_list,
const cl_event* event_wait_list,
cl_event* event);
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;OpenCL provides Read/Write commands to force overwrite of one buffer over the other, and in this way, data transfer is achieved.&lt;/p&gt;

&lt;h4 id="mapunmap-buffer"&gt;Map/Unmap Buffer&lt;/h4&gt;
&lt;p&gt;There is a single buffer on device memory, that is presented to CPU when demanded
So “mapping” a buffer will bring it from device memory into host RAM.
Then any changes made will be saved in host RAM.
Finally, once done with changes, you may “unmap” the buffer, which writes all changes made back to device memory&lt;/p&gt;

&lt;pre&gt;&lt;code class="language-C"&gt;void* clEnqueueMapBuffer(
cl_command_queue command_queue,
cl_mem buffer,
cl_bool blocking_map,
cl_map_flags map_flags,
size_t offset,
size_t size,
cl_uint num_events_in_wait_list,
const cl_event* event_wait_list,
cl_event* event,
cl_int* errcode_ret);
&lt;/code&gt;&lt;/pre&gt;

&lt;pre&gt;&lt;code class="language-C"&gt;cl_int clEnqueueUnmapMemObject(
cl_command_queue command_queue,
cl_mem memobj,
void* mapped_ptr,
cl_uint num_events_in_wait_list,
const cl_event* event_wait_list,
cl_event* event);
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In this way, data transfer is achieved&lt;/p&gt;

&lt;h2 id="to-map-or-not-to-map"&gt;To map or not to map?&lt;/h2&gt;
&lt;p&gt;To be honest, performance differences are very minute, atleast from my tests with the gnuastro library.
However Map and Unmapping makes a world of difference as compared to Read/Write when it comes to simplicity&lt;/p&gt;

&lt;h2 id="the-problem-with-buffers"&gt;The problem with buffers&lt;/h2&gt;
&lt;p&gt;No matter what you do, when working with buffers, you always end up copying the data
For example, you load an image into CPU RAM, but actually want to work with it on the GPU.
So, you end up copying the image into GPU RAM. In the end, you process the same data twice, once while loading and once while copying&lt;/p&gt;

&lt;p&gt;For small images(2000 x 2000) this is barely noticeable
But gnuastro, and the people using gnuastro deal with astronomical images of incredibly large sizes(i’ve heard 30GB just for one image).&lt;/p&gt;

&lt;p&gt;So, most certainly, any time you save by using parallelised processing on the GPU, is lost and maybe even worsened by the data transfer times.
Then, using the GPU is almost pointless, unless you use the same data over and over again&lt;/p&gt;

&lt;p&gt;“Well, cant I just load the data on the GPU directly?”
Thats not possible, atleast not to my knowledge. This is the tradeoff with GPUs.
On a CPU, you have 4/8/16 highly specialised and capable cores(math, I/O), while on the GPU you have 1000s of some very primitive math operations(only math, no I/O)
So you always have to load it into CPU RAM first and then go to GPU RAM.&lt;/p&gt;

&lt;p&gt;So how can we fix this problem?
Well, one of the options is to use Shared Virtual Memory(OpenCL SVM), which enables the GPU to directly access CPU RAM and play with CPU pointers.&lt;/p&gt;

&lt;p&gt;However, I still have yet to test SVM in the context of gnuastro, to see if its useful.
Besides, SVM also fixes the problem of structs containing pointers(for another post).
Documentation for OpenCL is already sparse, and to add insult to injury, documentation on OpenCL SVM is even more sparse.
But I like the challenge…&lt;/p&gt;</description><category>gnuastro</category><guid>http://openastronomy.org/Universe_OA/posts/2024/07/20240713_2230_deadspheroid/</guid><pubDate>Sat, 13 Jul 2024 21:30:00 GMT</pubDate></item><item><title>Deeper into OpenCL</title><link>http://openastronomy.org/Universe_OA/posts/2024/06/20240622_2045_deadspheroid/</link><dc:creator>DeadSpheroid</dc:creator><description>&lt;p class="intro"&gt;In this post, I hope to give a high level understanding of OpenCL and its workings&lt;/p&gt;

&lt;h2 id="setting-it-up"&gt;Setting it up&lt;/h2&gt;
&lt;!-- TEASER_END --&gt;
&lt;p align="center" width="100%"&gt;
&lt;img alt="OpenCL ICD" src="https://deadspheroid.github.io/my-blog/assets/img/ocl-icd.png" style="margin-bottom: 0; margin-top: 24px;"&gt;
&lt;/p&gt;
&lt;p&gt;OpenCL is relatively easy to get up and running on your system.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;For users:
All you need, is the OpenCL runtime for your device!
In case of Nvidia, this comes with the Nvidia drivers, while for Intel CPUs, this has to be manually installed by a package manager.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;For developers:
You will need the OpenCL library to link against, and the OpenCL headers as well, again easily available in your package manager.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id="heres-a-new-perspective"&gt;Here’s a new perspective&lt;/h2&gt;
&lt;p align="center" width="100%"&gt;
&lt;img alt="OpenCL Platform Model" src="https://deadspheroid.github.io/my-blog/assets/img/ocl-platform.png" style="margin-bottom: 0; margin-top: 24px;"&gt;
&lt;/p&gt;
&lt;p&gt;OpenCL presents a general interface to the developer, no matter what the device or the architecture.&lt;/p&gt;

&lt;p&gt;Firstly, we have the host, which is responsible for all the book-keeping, and task scheduling on the OpenCL device.&lt;/p&gt;

&lt;p&gt;Then, we have our OpenCL device, which is divided into a number of compute units.
Each Compute Unit (CU) is further divided into a number of processing elements.&lt;/p&gt;

&lt;p&gt;But what do these words actually mean?&lt;/p&gt;

&lt;p&gt;Well, a Processing Element(PE) is a single unit, that is responsible for executing a single thread(also called a work item). Think of a single function being executed.&lt;/p&gt;

&lt;p&gt;Each PE has its own private memory, not accessible by anyone, but this PE&lt;/p&gt;

&lt;p&gt;A bunch of processing elements are grouped together to form a compute unit which, at a time, executes a single work group(grouping of many work items).&lt;/p&gt;

&lt;p&gt;The CUs all share a global memory, accessible by anyone&lt;/p&gt;

&lt;p&gt;So for a CPU, the maximum number of CU s is the number of CPU cores!&lt;/p&gt;

&lt;p&gt;But why do you want work groups? Why not have work items only?&lt;/p&gt;

&lt;p&gt;Well, having this grouping of work items, allows for a greater deal of complexity, because we can synchronize across items in a work group, have a local memory only for this work group, and more…&lt;/p&gt;

&lt;h2 id="a-complete-walkthrough"&gt;A complete walkthrough&lt;/h2&gt;
&lt;p align="center" width="100%"&gt;
&lt;img alt="OpenCL Execution Model" src="https://deadspheroid.github.io/my-blog/assets/img/ocl-exec.png" style="margin-bottom: 0; margin-top: 24px;"&gt;
&lt;/p&gt;

&lt;p&gt;Let’s look at a typical workflow for an OpenCL program&lt;/p&gt;

&lt;h3 id="initialisation"&gt;Initialisation&lt;/h3&gt;
&lt;p&gt;First we need to check the currently available OpenCL platforms, which are basically implementations of OpenCL available on your system&lt;/p&gt;

&lt;p&gt;For example, you can have both Intel OpenCL and POCL OpenCL for your i7 CPU.&lt;/p&gt;

&lt;p&gt;Then from these platforms, you need to choose a device to execute on. OpenCL supports CPUs, GPUs, FPGAs, and all sorts of accelerators.&lt;/p&gt;

&lt;h3 id="context"&gt;Context&lt;/h3&gt;
&lt;p&gt;Once you have the platform and device you wish to use, you need to create an OpenCL context, which will handle everything for that particular platform and device.&lt;/p&gt;

&lt;h3 id="command-queue"&gt;Command Queue&lt;/h3&gt;
&lt;p&gt;Then, you have to create a command queue, which, as the name suggests, will store any commands(kernels) you queue for execution, and dispatch them in order(or even out of order if you like!).&lt;/p&gt;

&lt;h3 id="kernel"&gt;Kernel&lt;/h3&gt;
&lt;p&gt;After the command queue, you must compile the kernel source code(the api provides functions to do this), so that it can be executed later.&lt;/p&gt;

&lt;h3 id="memory"&gt;Memory&lt;/h3&gt;
&lt;p&gt;Finally, one of the most important parts of this entire process, is passing the input to the OpenCL device.&lt;/p&gt;

&lt;p&gt;Now, initially the data is stored on your CPU RAM, which is unfortunately inaccessible to your GPU.&lt;/p&gt;

&lt;p&gt;Therefore you need to copy the data to your GPU RAM, using the &lt;code class="language-plaintext highlighter-rouge"&gt;cl_mem&lt;/code&gt; interface that OpenCL provides.&lt;/p&gt;

&lt;p&gt;However, if you know that the device being used is the same CPU, then this copy can be skipped, to save time, using the &lt;code class="language-plaintext highlighter-rouge"&gt;CL_MEM_USE_HOST_PTR&lt;/code&gt; flag while creating a &lt;code class="language-plaintext highlighter-rouge"&gt;cl_mem&lt;/code&gt; object.&lt;/p&gt;

&lt;h3 id="execution"&gt;Execution&lt;/h3&gt;
&lt;p&gt;At the end, you can use the command queue created earlier along with the &lt;code class="language-plaintext highlighter-rouge"&gt;cl_mem&lt;/code&gt; created previously to execute the compiled kernel on the device&lt;/p&gt;

&lt;p&gt;Subsequently don’t forget to copy the output data back to CPU RAM, if the execution was done on GPU.&lt;/p&gt;

&lt;p&gt;However, there’s still a ton of unexplained stuff like, “How do you save the time wasted in copying data to the device?” or “Can you pass any data to the device? Even structs?”&lt;/p&gt;

&lt;p&gt;We’ll explore OpenCL more in subsequent posts.&lt;/p&gt;</description><category>gnuastro</category><guid>http://openastronomy.org/Universe_OA/posts/2024/06/20240622_2045_deadspheroid/</guid><pubDate>Sat, 22 Jun 2024 19:45:00 GMT</pubDate></item><item><title>OpenCL, meet the Gnuastro Build System</title><link>http://openastronomy.org/Universe_OA/posts/2024/06/20240609_0045_deadspheroid/</link><dc:creator>DeadSpheroid</dc:creator><description>&lt;p class="intro"&gt;In this post, I hope to summarize the work done so far towards my GSoC project for integrating OpenCL with the Gnuastro library and my relatively limited understanding of OpenCL.&lt;/p&gt;

&lt;h2 id="what-is-opencl"&gt;What is OpenCL?&lt;/h2&gt;
&lt;!-- TEASER_END --&gt;
&lt;p align="center" width="100%"&gt;
&lt;img alt="OpenCL Logo" src="https://deadspheroid.github.io/my-blog/assets/img/opencl-logo.png" style="margin-bottom: 0; margin-top: 24px;"&gt;
&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;&lt;a href="https://www.khronos.org/opencl/"&gt;Open Computing Language&lt;/a&gt;&lt;/strong&gt; is a framework for writing programs that execute across &lt;strong&gt;heterogenous&lt;/strong&gt; platforms. In simpler terms, OpenCL provides a standard interface for programmers to execute the &lt;strong&gt;same&lt;/strong&gt; code across &lt;strong&gt;multiple&lt;/strong&gt; devices, be it a CPU or a GPU or &lt;strong&gt;any&lt;/strong&gt; other accelerator.&lt;/p&gt;

&lt;p&gt;It comprises of the OpenCL standard which is maintained by &lt;a href="https://www.khronos.org/opencl/"&gt;Khronos&lt;/a&gt;, and implemented by the various hardware &lt;strong&gt;manufacturers&lt;/strong&gt; and by the &lt;strong&gt;open source community&lt;/strong&gt; across a wide variety of devices.&lt;/p&gt;

&lt;p&gt;Most modern devices all support OpenCL in some format or the other. &lt;strong&gt;Intel/Nvidia&lt;/strong&gt; for example provide their own &lt;strong&gt;propietary&lt;/strong&gt; implementations. On the other hand, &lt;strong&gt;POCL&lt;/strong&gt; an &lt;strong&gt;open source&lt;/strong&gt; project provides implementations for those that dont have actively maintained propietary ones, like &lt;strong&gt;AMD&lt;/strong&gt;.&lt;/p&gt;

&lt;h2 id="why-opencl"&gt;Why OpenCL?&lt;/h2&gt;

&lt;p align="center" width="100%"&gt;
&lt;img alt="OpenCL versus CUDA" src="https://deadspheroid.github.io/my-blog/assets/img/cl-cuda.jpeg" style="margin-bottom: 0; margin-top: 24px;"&gt;
&lt;/p&gt;

&lt;p&gt;Unlike certain &lt;strong&gt;propietary&lt;/strong&gt; frameworks &lt;em&gt;cough&lt;/em&gt; &lt;a href="https://developer.nvidia.com/about-cuda"&gt;CUDA&lt;/a&gt; &lt;em&gt;cough&lt;/em&gt;, OpenCL is not constrained to any particular &lt;strong&gt;manufacturer&lt;/strong&gt;. You can target &lt;strong&gt;any GPU/CPU&lt;/strong&gt; as long as you get the OpenCL implementation for that device. This is made easy thanks to projects like &lt;a href="https://portablecl.org/"&gt;POCL&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The performance of &lt;strong&gt;CUDA versus OpenCL&lt;/strong&gt; is heavily debated and leans towards CUDA for Nvidia hardware, but the difference depends on the use case and isn’t too much of a concern as compared to the way they are used.&lt;/p&gt;

&lt;p&gt;The &lt;strong&gt;downside&lt;/strong&gt; of OpenCL is the &lt;strong&gt;smaller&lt;/strong&gt; community and the lack of many &lt;strong&gt;modern features&lt;/strong&gt; that CUDA brings.&lt;/p&gt;

&lt;blockquote&gt;
The inner workings of OpenCL, how I managed to set it up, how the OpenCL C API works is another long story and is deserving of its own post.
&lt;/blockquote&gt;

&lt;h2 id="kickoff-with-gnuastro"&gt;Kickoff with Gnuastro&lt;/h2&gt;

&lt;p&gt;The first goal for the project was to figure out a way to &lt;strong&gt;integrate&lt;/strong&gt; OpenCL with the Gnuastro build system.&lt;/p&gt;

&lt;p&gt;Gnuastro like many other free software uses the &lt;strong&gt;GNU Build System&lt;/strong&gt; also called &lt;a href="https://www.gnu.org/software/automake/faq/autotools-faq.html"&gt;GNU Autotools&lt;/a&gt;&lt;/p&gt;

&lt;p align="center" width="100%"&gt;
&lt;img alt="GNU Autotools" src="https://deadspheroid.github.io/my-blog/assets/img/gnu-logo.png" style="margin-bottom: 0; margin-top: 24px;"&gt;
&lt;/p&gt;

&lt;p&gt;The &lt;strong&gt;three&lt;/strong&gt; major components of Autotools are:&lt;/p&gt;

&lt;h4 id="autoconf"&gt;Autoconf&lt;/h4&gt;
&lt;p&gt;At the heart of Autotools, we have &lt;a href="https://www.gnu.org/software/autoconf/"&gt;Autoconf&lt;/a&gt;, which generates a &lt;strong&gt;single&lt;/strong&gt; &lt;code class="language-plaintext highlighter-rouge"&gt;configure&lt;/code&gt; &lt;strong&gt;script&lt;/strong&gt; from a &lt;code class="language-plaintext highlighter-rouge"&gt;configure.ac&lt;/code&gt; file.&lt;/p&gt;

&lt;p&gt;This &lt;code class="language-plaintext highlighter-rouge"&gt;configure&lt;/code&gt; script scans the &lt;strong&gt;environment&lt;/strong&gt; for various files and &lt;strong&gt;libraries&lt;/strong&gt;, specific versions of them, the &lt;strong&gt;hardware&lt;/strong&gt; being used, and more. Then, it &lt;strong&gt;configures&lt;/strong&gt; the build of the project in certain ways enabling/disabling certain parts depending on what was found and what wasnt.&lt;/p&gt;

&lt;p&gt;In this way, the &lt;strong&gt;portability&lt;/strong&gt; of any project can be ensured by simply distributing the &lt;strong&gt;configure&lt;/strong&gt; script, along with the &lt;code class="language-plaintext highlighter-rouge"&gt;Makefile.in&lt;/code&gt;s.&lt;/p&gt;

&lt;h4 id="automake"&gt;Automake&lt;/h4&gt;
&lt;p&gt;&lt;a href="https://www.gnu.org/software/automake/"&gt;Automake&lt;/a&gt; makes use of information found by &lt;code class="language-plaintext highlighter-rouge"&gt;configure&lt;/code&gt; and &lt;strong&gt;generates&lt;/strong&gt; the &lt;code class="language-plaintext highlighter-rouge"&gt;Makefile&lt;/code&gt;s necessary to &lt;strong&gt;build&lt;/strong&gt; the project.&lt;/p&gt;

&lt;p&gt;To be more precise, it &lt;strong&gt;parses&lt;/strong&gt; &lt;code class="language-plaintext highlighter-rouge"&gt;Makefile.am&lt;/code&gt;s into &lt;code class="language-plaintext highlighter-rouge"&gt;Makefile.in&lt;/code&gt;s which are in turn &lt;strong&gt;parsed&lt;/strong&gt; by &lt;code class="language-plaintext highlighter-rouge"&gt;configure&lt;/code&gt; to produce the final &lt;code class="language-plaintext highlighter-rouge"&gt;Makefile&lt;/code&gt;s. Automake also performs &lt;strong&gt;automatic dependency tracking&lt;/strong&gt;, so that recompilling isn’t done unless &lt;strong&gt;required&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;All you have to do, is specify the &lt;strong&gt;name&lt;/strong&gt; and each of the &lt;strong&gt;sources&lt;/strong&gt; involved in the library/binary, and Automake does the rest.&lt;/p&gt;

&lt;h4 id="libtool"&gt;Libtool&lt;/h4&gt;
&lt;p&gt;&lt;a href="https://www.gnu.org/software/libtool/"&gt;Libtool&lt;/a&gt; is responsible for abstracting the &lt;strong&gt;library&lt;/strong&gt; creation process, since different platforms handle static/dynamic libraries &lt;strong&gt;differently&lt;/strong&gt;.&lt;/p&gt;

&lt;blockquote&gt;
I mainly worked with Automake and Autoconf during integration and didn't really touch Libtool.
&lt;/blockquote&gt;

&lt;h2 id="stepping-into-integration"&gt;Stepping into Integration&lt;/h2&gt;

&lt;h4 id="inside-configureac"&gt;Inside &lt;code class="language-plaintext highlighter-rouge"&gt;configure.ac&lt;/code&gt;&lt;/h4&gt;
&lt;p&gt;Without getting into detail, when checking for the &lt;strong&gt;presence&lt;/strong&gt; of OpenCL, it suffices to check for &lt;code class="language-plaintext highlighter-rouge"&gt;libOpenCL.so&lt;/code&gt; and the &lt;code class="language-plaintext highlighter-rouge"&gt;CL.h&lt;/code&gt; header file.&lt;/p&gt;

&lt;p&gt;That is, Gnuastro should be able to &lt;strong&gt;include&lt;/strong&gt; the OpenCL header file to use its C API, and then later &lt;strong&gt;link&lt;/strong&gt; against the OpenCL library.&lt;/p&gt;

&lt;p&gt;Luckily for us, &lt;a href="https://www.gnu.org/software/gnulib/"&gt;Gnulib&lt;/a&gt; provides a simple &lt;code class="language-plaintext highlighter-rouge"&gt;AC_LIB_HAVE_LINKFLAGS&lt;/code&gt; &lt;a href="https://www.gnu.org/software/gnulib/manual/html_node/Searching-for-Libraries.html"&gt;macro&lt;/a&gt; which takes as input, a library &lt;strong&gt;name&lt;/strong&gt; and a &lt;strong&gt;test code&lt;/strong&gt; and tries to find the &lt;strong&gt;library&lt;/strong&gt; and &lt;strong&gt;compile/link&lt;/strong&gt; the test code.&lt;/p&gt;

&lt;p&gt;Upon successfully executing, it &lt;strong&gt;sets certain variables&lt;/strong&gt;, so we can modify further building on the basis of &lt;strong&gt;finding OpenCL&lt;/strong&gt;.&lt;/p&gt;
&lt;div class="language-bash highlighter-rouge"&gt;&lt;div class="highlight"&gt;&lt;pre class="highlight"&gt;&lt;code&gt;AC_LIB_HAVE_LINKFLAGS&lt;span class="o"&gt;([&lt;/span&gt;OpenCL], &lt;span class="o"&gt;[]&lt;/span&gt;, &lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="c"&gt;#include &amp;lt;CL/cl.h&amp;gt;])&lt;/span&gt;
AS_IF&lt;span class="o"&gt;([&lt;/span&gt;&lt;span class="nb"&gt;test&lt;/span&gt; &lt;span class="s2"&gt;"x&lt;/span&gt;&lt;span class="nv"&gt;$LIBOPENCL&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; x],
&lt;span class="o"&gt;[&lt;/span&gt;
&lt;span class="k"&gt;if &lt;/span&gt;successfull ...
&lt;span class="o"&gt;]&lt;/span&gt;,
&lt;span class="o"&gt;[&lt;/span&gt;
&lt;span class="nv"&gt;LIBS&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$LIBOPENCL&lt;/span&gt;&lt;span class="s2"&gt; &lt;/span&gt;&lt;span class="nv"&gt;$LIBS&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;span class="nv"&gt;has_ocl&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;1&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;if &lt;/span&gt;unsuccessfull ...
&lt;span class="o"&gt;])&lt;/span&gt;
AM_CONDITIONAL&lt;span class="o"&gt;([&lt;/span&gt;COND_HASOPENCL], &lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;test&lt;/span&gt; &lt;span class="s2"&gt;"x&lt;/span&gt;&lt;span class="nv"&gt;$has_ocl&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"x1"&lt;/span&gt;&lt;span class="o"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;After making these modifications to &lt;code class="language-plaintext highlighter-rouge"&gt;configure.ac&lt;/code&gt;, we can now &lt;strong&gt;test&lt;/strong&gt; whether OpenCL was found inside the various &lt;code class="language-plaintext highlighter-rouge"&gt;Makefile.am&lt;/code&gt;s and accordingly change the &lt;strong&gt;build&lt;/strong&gt;.&lt;/p&gt;

&lt;h4 id="inside-makefileam"&gt;Inside &lt;code class="language-plaintext highlighter-rouge"&gt;Makefile.am&lt;/code&gt;&lt;/h4&gt;
&lt;p&gt;Now, we can use the &lt;strong&gt;variable&lt;/strong&gt; we set previously in &lt;code class="language-plaintext highlighter-rouge"&gt;configure.ac&lt;/code&gt; and either include or exclude the OpenCL modules from being compiled and included in the &lt;strong&gt;library&lt;/strong&gt;.&lt;/p&gt;

&lt;div class="language-bash highlighter-rouge"&gt;&lt;div class="highlight"&gt;&lt;pre class="highlight"&gt;&lt;code&gt;&lt;span class="k"&gt;if &lt;/span&gt;COND_HASOPENCL
&lt;span class="si"&gt;$(&lt;/span&gt;info &lt;span class="s2"&gt;"Found OpenCL"&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;
MAYBE_CL &lt;span class="o"&gt;=&lt;/span&gt; cl_utils.c
MAYBE_CL_H &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="si"&gt;$(&lt;/span&gt;headersdir&lt;span class="si"&gt;)&lt;/span&gt;/cl_utils.h
MAYBE_CONVOLVE_CL &lt;span class="o"&gt;=&lt;/span&gt; cl_convolve.c
&lt;span class="k"&gt;else&lt;/span&gt;
&lt;span class="si"&gt;$(&lt;/span&gt;info &lt;span class="s2"&gt;"What is Opencl?"&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;
endif
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;div class="language-bash highlighter-rouge"&gt;&lt;div class="highlight"&gt;&lt;pre class="highlight"&gt;&lt;code&gt;libgnuastro_la_SOURCES &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="si"&gt;$(&lt;/span&gt;MAYBE_NUMPY_C&lt;span class="si"&gt;)&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="si"&gt;$(&lt;/span&gt;MAYBE_WCSDISTORTION&lt;span class="si"&gt;)&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="si"&gt;$(&lt;/span&gt;MAYBE_CL&lt;span class="si"&gt;)&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="si"&gt;$(&lt;/span&gt;MAYBE_CONVOLVE_CL&lt;span class="si"&gt;)&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
arithmetic.c &lt;span class="se"&gt;\&lt;/span&gt;
arithmetic-and.c &lt;span class="se"&gt;\&lt;/span&gt;
...
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Additionally, we need to &lt;strong&gt;save&lt;/strong&gt; this variable in Gnuastro’s &lt;code class="language-plaintext highlighter-rouge"&gt;config.h&lt;/code&gt; file for later use to &lt;strong&gt;prevent&lt;/strong&gt; other modules from mistakenly including the OpenCL ones incase OpenCL was &lt;strong&gt;not compiled&lt;/strong&gt;.&lt;/p&gt;

&lt;h2 id="checking-for-build-system-yes"&gt;checking for build system… yes&lt;/h2&gt;
&lt;p&gt;Now when someone builds Gnuastro, if OpenCL is &lt;strong&gt;present&lt;/strong&gt; on their system, then the OpenCL relevant files are &lt;strong&gt;compiled and included in the library&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;On the other hand, if OpenCL is &lt;strong&gt;absent&lt;/strong&gt;, then the library is &lt;strong&gt;built as normal&lt;/strong&gt;, as if OpenCL never existed.&lt;/p&gt;

&lt;p&gt;Finally, we can get started with the &lt;strong&gt;actual&lt;/strong&gt; OpenCl part and we’ll have a look at Image Convolution(astconvolve) in the next post…&lt;/p&gt;</description><category>gnuastro</category><guid>http://openastronomy.org/Universe_OA/posts/2024/06/20240609_0045_deadspheroid/</guid><pubDate>Sat, 08 Jun 2024 23:45:00 GMT</pubDate></item><item><title>OpenCL, meet the Gnuastro Build System</title><link>http://openastronomy.org/Universe_OA/posts/2024/06/20240609_0000_deadspheroid/</link><dc:creator>DeadSpheroid</dc:creator><description>&lt;p class="intro"&gt;&lt;span class="dropcap"&gt;I&lt;/span&gt;n this post, I hope to summarize the work done so far towards my GSoC project in integrating OpenCL with the Gnuastro library and my relatively limited understanding of OpenCL.&lt;/p&gt;
&lt;!-- TEASER_END --&gt;</description><category>gnuastro</category><guid>http://openastronomy.org/Universe_OA/posts/2024/06/20240609_0000_deadspheroid/</guid><pubDate>Sat, 08 Jun 2024 23:00:00 GMT</pubDate></item><item><title>Final GSoC Report</title><link>http://openastronomy.org/Universe_OA/posts/2023/08/20230822_0000_labeeb-7z/</link><dc:creator>Labib Asari</dc:creator><description>&lt;p&gt;I will be discussing the goals of my GSoC project, how I spent my time and what I learned during this period. I will also be discussing the future of my project and what I plan to do next.&lt;/p&gt;

&lt;h2 id="goals-of-my-gsoc"&gt;Goals of my GSoC&lt;/h2&gt;
&lt;!-- TEASER_END --&gt;

&lt;p&gt;The &lt;a href="https://openastronomy.org/gsoc/gsoc2023/#/projects?project=gnuastro_library_in_python"&gt;original&lt;/a&gt; Google Summer of Code project this was year was to :&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Redisgn the &lt;code class="language-plaintext highlighter-rouge"&gt;error handling&lt;/code&gt; inside Gnuastro C library.&lt;/li&gt;
&lt;li&gt;Adding wrappers for Gnuastro library functions in &lt;code class="language-plaintext highlighter-rouge"&gt;pyGnuastro&lt;/code&gt;.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Prior to GSoC, my experience mostly consisted of Deep Learning and Computer Vision. I had a good high-level understanding of how GPUs were leveraged for the compute intensive tasks in various libraries and frameworks in these domains. I had started exploring the lower-level abstractions over GPUs using the CUDA framework.&lt;/p&gt;

&lt;p&gt;In the early weeks of February, I delivered a &lt;a href="https://docs.google.com/presentation/d/1texW2MQJqjdbtPCuLULqXf8-1GuIrub4bJh_b_EffS4/edit?usp=sharing"&gt;presentation&lt;/a&gt; to the Gnuastro development team. The point of this presentation was a proposal outlining the integration of GPU support into Gnuastro — an idea borrowed from the Machine Learning world but with huge advancement potential in the feild of Astronomy. Both of these domains process huge amounts of data. Both of these domains are characterized by the processing of substantial volumes of data.&lt;/p&gt;

&lt;p&gt;My mentor &lt;a href="https://akhlaghi.org/"&gt;Mohammad Akhlaghi&lt;/a&gt; was very supportive of this idea and gave me the go ahead to start working on it.&lt;/p&gt;

&lt;p&gt;And so, we had a 3rd goal for this GSoC project :&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Adding GPU support to Gnuastro.&lt;/li&gt;
&lt;/ol&gt;

&lt;h3 id="work-done-throughout-the-gsoc"&gt;Work Done Throughout the GSoC&lt;/h3&gt;

&lt;h4 id="error-handling"&gt;Error Handling&lt;/h4&gt;
&lt;p&gt;&lt;strong&gt;Need&lt;/strong&gt; : All of Gnuastro’s library functions  performed error handling using &lt;code class="language-plaintext highlighter-rouge"&gt;error(EXIT_FAILURE, ....)&lt;/code&gt;; thus exiting the program whenever an error was encountered with a detailed error message. This wasn’t a problem for the Gnuastro programs however for other callers like pyGnuastro, this is problematic as it exits from the entire Python environment.&lt;/p&gt;

&lt;p&gt;The new error handling mechanism defines a module &lt;code class="language-plaintext highlighter-rouge"&gt;error.h&lt;/code&gt;  new data structure &lt;code class="language-plaintext highlighter-rouge"&gt;gal_error_t&lt;/code&gt;. The exact contents of this structure have gone through multiple iterations but the final one is :&lt;/p&gt;

&lt;p&gt;&lt;img alt="gal_error_t" src="https://labeeb-7z.github.io/Blogs/img/posts/final/gal_error.png"&gt;&lt;/p&gt;

&lt;p&gt;The user should define a &lt;code class="language-plaintext highlighter-rouge"&gt;gal_error_t&lt;/code&gt; before the function call and pass it as an argument to the function(every function in Gnuastro will have an extra argument now).&lt;/p&gt;

&lt;p&gt;During the function execution, if any error occurs, it will populate the &lt;code class="language-plaintext highlighter-rouge"&gt;gal_error_t&lt;/code&gt; with the error message and the error code. The user can then check the error code and the error message to determine what went wrong.&lt;/p&gt;

&lt;p&gt;&lt;img alt="new_error_handling" src="https://labeeb-7z.github.io/Blogs/img/posts/final/error_handling.png"&gt;&lt;/p&gt;

&lt;p&gt;Corresponding functions are added in &lt;code class="language-plaintext highlighter-rouge"&gt;error.h&lt;/code&gt; for writing and managing the structure. Some methods are also provided for Python interface.&lt;/p&gt;

&lt;p&gt;After the module was finished, Mohammad implemented the new error mechanism inside the &lt;code class="language-plaintext highlighter-rouge"&gt;cosmology.c&lt;/code&gt; module, and then I used it to update the corresponding cosmology module in pyGnuastro. This solved the main the problem of python environment exiting on any error, instead errors were being reported inside the python shell.&lt;/p&gt;

&lt;p&gt;&lt;img alt="python_error" src="https://labeeb-7z.github.io/Blogs/img/posts/final/py-error.png"&gt;&lt;/p&gt;

&lt;p&gt;This completed setting-up the low level infrastructure for the new error handling mechanism. This can be now used by other modules of Gnuastro to update what happens when an error occurs. Implementing the high level error function calls, deciding the exact error type and defining what message should be shown, would be best done by the original authors of the modules.&lt;/p&gt;

&lt;p&gt;The new error handling mechanism currently lives at the &lt;a href="https://gitlab.com/makhlaghi/gnuastro-dev/-/tree/error"&gt;Gnuastro repository&lt;/a&gt;.&lt;/p&gt;

&lt;h4 id="pygnuastro"&gt;pyGnuastro&lt;/h4&gt;

&lt;p&gt;Apart from implementing the new error handling mechanism in existing modules of pyGnuastro, I worked on 2 major things&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Implemented speclines module in pyGnuastro&lt;/strong&gt; : this is a simple module without any complex data structures. I tried this first when I was learning about the C-Python API. It gave me a good grasp of how and what’s going on in the existing pyGnuastro implementation.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;GAL_DATA_T for Python&lt;/strong&gt; :  The core data structure of Gnuastro - gal_data_t is a C struct. Any external data is represented using this structure. It was crucuial to had a similar structure in Python. Previously Jash had worked on loading and saving fits file made use of the Numpy-C API to to convert the raw data inside the gal_data_t to a Numpy array. This was an extremely clever and efficient idea, however it skipped all the other details inside gal_data_t. We had to find a way to represent the entire gal_data_t in Python. The normal way to create a new data structure in Python would be to create a new class. However, the wrappers are written in C language and we don’t get access to the Python interpreter. I took some more inspiration from Numpy on how they &lt;a href="https://numpy.org/doc/stable/reference/c-api/index.html"&gt;created a new Python&lt;/a&gt; - their core data structure : &lt;code class="language-plaintext highlighter-rouge"&gt;numpy.ndarray&lt;/code&gt; - using the C-Python API. I then discovered the API allows us to &lt;a href="https://docs.python.org/3/extending/newtypes_tutorial.html"&gt;define custom objects&lt;/a&gt; which may be used a data type for the Python interpreter. I learnt and used them to have a corresponding &lt;code class="language-plaintext highlighter-rouge"&gt;pygnuastro.data&lt;/code&gt; for pyGnuastro. It basically acted as a new data type in python similar to &lt;code class="language-plaintext highlighter-rouge"&gt;numpy.ndarray&lt;/code&gt;, had other details of gal_data_t.After this we had details of gal_data_t in python but we were missing on Jash’s idea of utilizing Numpy in pyGnuastro. I spent some time to make sure we can still utilize numpy’s speed inside pyGnuastro, The C-Python API is versatile and it allows having complex objects as sub-objects to other objects. Eventually we had the array(raw data) being represented as a &lt;code class="language-plaintext highlighter-rouge"&gt;numpy.ndarray&lt;/code&gt;! This meant we had both the speed of numpy and the details of gal_data_t in pyGnuastro’s &lt;code class="language-plaintext highlighter-rouge"&gt;pygnuastro.data&lt;/code&gt;. This was a major milestone in pyGnuastro.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;img alt="pyGnuastro.data" src="https://labeeb-7z.github.io/Blogs/img/posts/final/python-type.jpg"&gt;&lt;/p&gt;

&lt;h4 id="gpus-in-gnuastro"&gt;GPUs in Gnuastro&lt;/h4&gt;

&lt;p&gt;Gnuastro is an astronomical data analysis and manipulation library. Astronomical data is usually very large in size, and thus computationally intensive. If the operations performed on this data are parallelizable, then GPUs can significantly speed up the processing.&lt;/p&gt;

&lt;p&gt;I started my work on GPUs right after Mohammad approved my initial idea. Here’s a summary/story of all the work done for GPU support :&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Learning about build systems&lt;/strong&gt; : After GPU support idea was accepted, my mentor suggested we should first setup the build system so CUDA modules can be integrated smoothly in the future. Gnuastro uses &lt;a href="https://en.wikipedia.org/wiki/GNU_Autotools"&gt;Autotools&lt;/a&gt; for its build system. I started by learning about &lt;a href="https://www.gnu.org/software/autoconf/"&gt;autoconf&lt;/a&gt;, &lt;a href="https://www.gnu.org/software/automake/"&gt;automake&lt;/a&gt; and &lt;a href="https://www.gnu.org/software/libtool/"&gt;libtool&lt;/a&gt;.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Linking Gnuastro with CUDA runtime&lt;/strong&gt; : CUDA SDK provides a &lt;a href="https://nvidia.github.io/cuda-python/module/cudart.html"&gt;runtime library - &lt;code class="language-plaintext highlighter-rouge"&gt;cudart&lt;/code&gt;&lt;/a&gt; which the necessay component to initiate communication with the GPU drivers. The runtime library is distributed as both a static and shared object file. This made things easier as we could link the runtime library statically with the Gnuastro library, making &lt;code class="language-plaintext highlighter-rouge"&gt;cudart&lt;/code&gt; part of Gnuastro. I modified the configure script to link the runtime library statically with Gnuastro. This was also the time I learnt extensively about how low level system libraries are built, linked and distributed.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Struggling with Libtool&lt;/strong&gt; : I then tried to implement some simple matrix functions in CUDA and integrate them with Gnuastro.  CUDA source code is compiled by &lt;code class="language-plaintext highlighter-rouge"&gt;nvcc&lt;/code&gt; compiler. However during linking, libtool assumes that all source files are compiled by &lt;code class="language-plaintext highlighter-rouge"&gt;gcc&lt;/code&gt;. It ignored all the CUDA source files. After writing dedicated rules for CUDA source compilation in the Makefile, the CUDA source was getting compiled, but not being linked to the Gnuastro. Libtool only links files having a corresponding libtool object(.lo files) and they’re created by libtool for each source file handled by it(which in our case were gcc compiled files).&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;AutoMake developers rescuing us&lt;/strong&gt; : After trying and struggling with libtool for a few days, my mentor suggested that I contact the AutoMake developers to seek some help. I &lt;a href="https://lists.gnu.org/archive/html/automake/2023-03/msg00036.html"&gt;mailed&lt;/a&gt; them a &lt;a href="https://github.com/labeeb-7z/cuda-gnu/tree/main/shared-library"&gt;small demonstration&lt;/a&gt; of what I was trying to do and waited for there response. After a few days, I received a reply from them. The fix was actually simple, automake had special variables(&lt;code class="language-plaintext highlighter-rouge"&gt;LD_ADD&lt;/code&gt;) which directly communicates with the GNU linker (ld) and I just had to add CUDA object files to this variable. It worked and we finally had a working CUDA module in Gnuastro which used GPU for execution!&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;It was around 1st week of April now, I made my final proposal submission and had fingers crossed for getting selected in GSoC.&lt;/p&gt;

&lt;p&gt;As mentioned in the GSoC proposal, we had to first focus on the Error handling and Python wrappers, so I started working on these two goals (I was also indeed selected for GSoC in the meantime!).&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Convolutions on GPU&lt;/strong&gt; : After getting back to working with GPUs in around June-July, I started with implementing the convolution function in CUDA. Convolution is a direct operation as well as a subroutine to other operations in Gnuastro.
The results of CUDA convolution were remarkable. We got upto 400x speed up on convolution operation! My mentor then suggested me since the speedup is very significant, I should prioritise getting more of GPU work done.
Read more about Convolution on GPU in my blog &lt;a href="https://labeeb-7z.github.io/Blogs/2023/07/03/GPUs-and-Convolution.html"&gt;here&lt;/a&gt;.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Adapting OpenCL&lt;/strong&gt; : CUDA is a proprietary framework by Nvidia. It only works on Nvidia GPUs. We wanted to make Gnuastro GPU support available to all users, irrespective of the GPU they have. This is where OpenCL comes in. OpenCL is an open standard for parallel programming of heterogeneous systems. It is supported by all major GPU vendors. I started learning about OpenCL and how it works at a low level. I also started learning about the OpenCL C99 programming standard. Read more about starting with OpenCL in my blog &lt;a href="https://labeeb-7z.github.io/Blogs/2023/07/28/Towards-OpenCL.html"&gt;here&lt;/a&gt;.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Integrating OpenCL&lt;/strong&gt; : OpenCL was initially hard to learn, but I managed to integrate that with Gnuastro right before my GSoC’s official timeline was about to end! I have a pretty detailed blog on the the entire integration process &lt;a href="https://labeeb-7z.github.io/Blogs/2023/08/12/Integrating-OpenCL.html"&gt;here&lt;/a&gt;.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Same code on CPU and GPU&lt;/strong&gt; : After we had success with OpenCL, my mentor recommended we should try executing the &lt;code class="language-plaintext highlighter-rouge"&gt;exact&lt;/code&gt; same code on CPU and GPU - to show the concept of executing same instructions both processors and seeing the speed-up on GPUs. This was never done in the field of Astronomy so it’d have been a great demonstration. This was quite challenging as GPUs are programmed with different frameworks and have some extra components in code for management. Usually in Machine Learning frameworks, the GPU and CPU modules are generally written seperately(Infact Tensorflow used to have different package altogether for GPU until 2.0)
However the good part is, most of the GPU frameworks are derived from C/C++ language and have  . I spent my last week of GSoC trying to implement the core logic in a Macro which will be shared by both OpenCL kernels and C library and had success, this can be accessed here.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Future&lt;/strong&gt; : The future of this project is very bright. I have set up the bare-bone GPU integration already, I’ll continue to add GPU modules building upon it.. We have a working OpenCL integration. We have a working CUDA integration. We have a working CPU-GPU code sharing. I mentioned certain challenges we are currently facing in my &lt;a href="https://labeeb-7z.github.io/Blogs/2023/08/12/Integrating-OpenCL.html"&gt;opencl_integration&lt;/a&gt; blog. I’ll continue to figure out a solution for them and adding support for further modules on GPU.&lt;/p&gt;

&lt;h3 id="acknowledgements"&gt;Acknowledgements&lt;/h3&gt;

&lt;p&gt;GSoC has been a great learning experience for me. I’m extremely grateful to everyone who was part of this journey.&lt;/p&gt;

&lt;p&gt;I would like to thank my mentor &lt;a href="https://akhlaghi.org/"&gt;Mohammad Akhlaghi&lt;/a&gt; for his constant support and guidance throughout the project. He has been very patient right from the beginning, beleived in me when I did not have a clear idea on how I’d approach all the goals. He allowed me work on my pace, explore and learn things as needed and has always pulled me out of the rabbit hole whenever I got stuck. Everytime I join a meeting with him, I learn something new. I’m very grateful to him for giving me this opportunity to work on this project.&lt;/p&gt;

&lt;p&gt;I am Graciously thankful to Jash Shah for introducing me to the Gnuastro development team and walking me through the existing work on error handling and pyGnuastro. It provided me a huge boost was extremely valuable. He’s always been attentive to my small queries and has supported me through multiple challenges. In general, Im very grateful to have him as a mentor and freind.&lt;/p&gt;

&lt;p&gt;I would also like to thank the Gnuastro development team for their support and feedback throughout the project. Its been such a wonderful time working with them. I have learnt a ton from attending Pedram’s work on adding Sql to Gnuastro, Fathma’s work on Tiff files and Curl library, Faezeh’s work on implementing Convolutional Neural Networks in Gnuastro.
They’ve always been crucial in providing feedback and suggestions on my work. I’m very grateful to them for their support.
I am genuinely grateful for the opportunity to collaborate with such a talented and committed group, and I look forward to work and grow with them in the future.&lt;/p&gt;

&lt;p&gt;I would also like to thank the Google Summer of Code team for taking the wonderful initiative and giving me this opportunity to work on this project.&lt;/p&gt;</description><category>gnuastro</category><guid>http://openastronomy.org/Universe_OA/posts/2023/08/20230822_0000_labeeb-7z/</guid><pubDate>Mon, 21 Aug 2023 23:00:00 GMT</pubDate></item><item><title>Integrating OpenCL with Gnuastro</title><link>http://openastronomy.org/Universe_OA/posts/2023/08/20230812_0000_labeeb-7z/</link><dc:creator>Labib Asari</dc:creator><description>&lt;h3 id="background"&gt;Background&lt;/h3&gt;

&lt;p&gt;In the last post, I discussed what is OpenCL and why we chose to integrate it with Gnuastro. In this post, I’ll be discussing the actual implementation and the challenges I faced.&lt;/p&gt;
&lt;!-- TEASER_END --&gt;

&lt;h3 id="programming-in-opencl"&gt;Programming in OpenCL&lt;/h3&gt;

&lt;p&gt;The OpenCL 3.0 standard has done a great job of simplifying the programming model. The OpenCL 3.0 API is a header-only library that provides a modern, object-oriented interface to the OpenCL runtime. It is designed to be easy to use and provides a abstraction of the OpenCL runtime, making it easier to write portable code across different OpenCL implementations. We still have to communicate with the driver (unlike CUDA) at a low level, but this becomes a mandatory step when we want to run our code on different hardware (CUDA always expects an NVIDIA device).&lt;/p&gt;

&lt;p&gt;Here’s a general overview of steps to be followed when writing an using OpenCL :&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Check for available Platforms&lt;/strong&gt; : A platform is a collection of OpenCL devices. A platform can be a CPU, GPU, or an FPGA (Remember OpenCL can work with any platform!). This is done specifically to identify which OpenCL implementation will be used during runtime. We can query the system for available platforms using the &lt;code class="language-plaintext highlighter-rouge"&gt;clGetPlatformIDs&lt;/code&gt; function. This function returns a list of platforms available on the system.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Check for available devices&lt;/strong&gt; : A device is a physical device that can execute OpenCL kernels. A device can be a CPU, GPU, or an FPGA. We can query the system for available devices using the &lt;code class="language-plaintext highlighter-rouge"&gt;clGetDeviceIDs&lt;/code&gt; function. This function returns a list of devices available on the system.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Create a context&lt;/strong&gt; : A context is a container for all the OpenCL objects. It is used to manage the memory, command queues, and other OpenCL objects. It is created by passing a list of devices to the constructor. Since OpenCL can work with multiple devices, we can create a context with multiple devices. This is useful when we want to run our code on multiple devices at the same time.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Create a command queue&lt;/strong&gt; : A command queue is used to queue up commands for the device to execute. The command queue is used to give commands to the device. The device executes the commands in the order they are received. The commands can be kernel execution, memory transfer, or any other OpenCL command. We can also create multiple command queues. This is useful when we want to run to multiple commands. Command queues in OpenCL are asynchronous by default. This means that the commands are queued up and the control is returned to the host. The host can then continue with other tasks. We can also create a synchronous command queue. This means that the commands are queued up and the control is returned to the host only when the commands are executed.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Load the Kernel&lt;/strong&gt; : A kernel is a function that is executed on the device. It is written as per the &lt;code class="language-plaintext highlighter-rouge"&gt;C99 standard&lt;/code&gt;. We can load the kernel from a file or we can write the kernel inline. To maintain portablitiy, OpenCL kernels are generally compiled at runtime using &lt;code class="language-plaintext highlighter-rouge"&gt;clBuildProgram&lt;/code&gt;. We can also compile the kernel offline. This is useful when we want to compile the kernel for a specific device.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Copy Data to device memory&lt;/strong&gt; : All the data used in kernel, must be on the device memory. So we have to copy the data from the host to the device memory. We can do this using the &lt;code class="language-plaintext highlighter-rouge"&gt;clCreateBuffer&lt;/code&gt; function. This function creates a buffer on the device memory. We can then copy the data from the host to the device using the &lt;code class="language-plaintext highlighter-rouge"&gt;clEnqueueWriteBuffer&lt;/code&gt; function. This function copies the data from the host to the device.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Launch the kernel&lt;/strong&gt; : We can launch the kernel by passing the kernel object to the command queue. We have to set the arguments for the kernel seperately, using the &lt;code class="language-plaintext highlighter-rouge"&gt;clSetKernelArg&lt;/code&gt; function. We can also set the global and local work size. The global work size is the total number of work items that will be executed. The local work size is the number of work items that will be executed in a work group. The global work size should be a multiple of the local work size. If the global work size is not a multiple of the local work size, then the global work size is rounded up to the next multiple of the local work size.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Read the output&lt;/strong&gt; : We can read the output from the device using the &lt;code class="language-plaintext highlighter-rouge"&gt;clEnqueueReadBuffer&lt;/code&gt; function. This function copies the data from the device to the host.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id="implementation"&gt;Implementation&lt;/h3&gt;

&lt;p&gt;Among all the steps mentioned above, everything up till loading the kernel is common to all the programs that’ll be using OpenCL. So we defined a &lt;code class="language-plaintext highlighter-rouge"&gt;gpu_utils&lt;/code&gt; module which is responsible for querying for the available platforms and devices, creating the context and command queue, loading and compiling the kernel. The only external data it requires is the path to the kernel file. This is provided as an input.
It also provides utility functions to copy specific data types to and from device memory.&lt;/p&gt;

&lt;p&gt;There’ll be 2 types of OpenCL program in Gnuastro :&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Programs using OpenCL to speed-up existing operations inside Gnuastro.&lt;/li&gt;
&lt;li&gt;User defined OpenCL kernels, responsible for performing a custom task.&lt;/li&gt;
&lt;/ol&gt;

&lt;h4 id="programs-using-opencl-to-speed-up-existing-operations-inside-gnuastro"&gt;Programs using OpenCL to speed-up existing operations inside Gnuastro&lt;/h4&gt;

&lt;p&gt;These programs will be using OpenCL to speed-up existing operations inside Gnuastro. For example, we can use OpenCL to speed-up the &lt;code class="language-plaintext highlighter-rouge"&gt;astconvolve&lt;/code&gt; operation by passing an extra &lt;code class="language-plaintext highlighter-rouge"&gt;--gpu&lt;/code&gt;. For these programs, the OpenCL kernels will be part of the Gnuastro Library.&lt;/p&gt;

&lt;p&gt;The general flow of the program then becomes :&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The user passes the input data for a specific operation, and also choses the local and global work size.&lt;/li&gt;
&lt;li&gt;The program then initializes the device using &lt;code class="language-plaintext highlighter-rouge"&gt;gpu_utils&lt;/code&gt; module by providing the kernel file from the library, which does everything and returns a &lt;code class="language-plaintext highlighter-rouge"&gt;cl_kernel&lt;/code&gt; (which is essentially the compiled kernel).&lt;/li&gt;
&lt;li&gt;Data transfer from CPU to device (GPU) is done using the functions provided by &lt;code class="language-plaintext highlighter-rouge"&gt;gpu_utils&lt;/code&gt; module.&lt;/li&gt;
&lt;li&gt;The kernel is launched using with the provided global and local work size.&lt;/li&gt;
&lt;li&gt;Data is copied back to CPU memory and returned to the user.&lt;/li&gt;
&lt;/ul&gt;

&lt;h4 id="user-defined-opencl-kernels-responsible-for-performing-a-custom-task"&gt;User defined OpenCL kernels, responsible for performing a custom task&lt;/h4&gt;

&lt;p&gt;These programs will be using OpenCL to perform a custom task. For example, we can use OpenCL to perform a custom convolution operation by passing a custom kernel. For these programs, the OpenCL kernels will be provided by the user. The exact design details yet to be determined for this.&lt;/p&gt;

&lt;h3 id="results"&gt;Results&lt;/h3&gt;
&lt;p&gt;Input image is 10,000 x 20,000 random image with normal distribution.
Kernel is 7 x 7 standard convolution kernel.
CPU : Intel(R) Core(TM) i5-9300HF CPU @ 2.40GHz
GPU : NVIDIA GeForce GTX 1650&lt;/p&gt;

&lt;p&gt;Convolution using existing convolution in Gnuastro :&lt;/p&gt;

&lt;p&gt;&lt;img alt="Convolution using existing convolution in Gnuastro" src="https://labeeb-7z.github.io/Blogs/img/posts/opencl-imp/conv_cpu.png"&gt;&lt;/p&gt;

&lt;p&gt;Convolution on OpenCL :&lt;/p&gt;

&lt;p&gt;&lt;img alt="Convolution on OpenCL" src="https://labeeb-7z.github.io/Blogs/img/posts/opencl-imp/conv_gpu.png"&gt;&lt;/p&gt;

&lt;p&gt;Result&lt;/p&gt;

&lt;p&gt;&lt;img alt="Result" src="https://labeeb-7z.github.io/Blogs/img/posts/opencl-imp/res.png"&gt;&lt;/p&gt;

&lt;p&gt;The speed up for convolution operation is specifically ranges from 300-500x, but for the entire operation its around 3-5x due to the overhead of copying data to and from the device. Overcoming this is a big and important challenge!&lt;/p&gt;

&lt;h3 id="challenges"&gt;Challenges&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;No &lt;code class="language-plaintext highlighter-rouge"&gt;GAL_DATA_T&lt;/code&gt; inside OpenCL kernel!&lt;/strong&gt; : Inside OpenCL, &lt;code class="language-plaintext highlighter-rouge"&gt;cl_mem&lt;/code&gt; is the primary object used to represent memory objects such as buffers and images. It is used to allocate memory on the device. Regardless of where the data is coming from on device (arrays, structs, etc), it’s all converted into a &lt;code class="language-plaintext highlighter-rouge"&gt;cl_mem&lt;/code&gt; object when copied to the device.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;However inside Gnuastro, the core data structure is &lt;code class="language-plaintext highlighter-rouge"&gt;gal_data_t&lt;/code&gt; which is essentially just a C struct.&lt;/p&gt;

&lt;p&gt;Why is this a problem? Well the raw data of the input image/table is not contained inside the &lt;code class="language-plaintext highlighter-rouge"&gt;gal_data_t&lt;/code&gt;. It merely consists a pointer to that data! So wehn we copy the &lt;code class="language-plaintext highlighter-rouge"&gt;gal_data_t&lt;/code&gt; to device, the raw data(which is huge) is not copied. (It lives on the CPU memory, and hence cant use CPU pointers on GPU memory).&lt;/p&gt;

&lt;p&gt;What about copying the raw data seperately on the GPU memory, and then replacing the pointer inside &lt;code class="language-plaintext highlighter-rouge"&gt;gal_data_t&lt;/code&gt; with a pointer which has the address on the GPU memory? Well, this is not possible either. Why? See, when we are on CPU, we’ve a good &lt;code class="language-plaintext highlighter-rouge"&gt;gal_data_t&lt;/code&gt; struct which is a single big object with ‘sub-objects’(one of which is the pointer). But on GPU, we’ve a &lt;code class="language-plaintext highlighter-rouge"&gt;cl_mem&lt;/code&gt; which is an object, but unlike structs, it cant have sub-objects!&lt;/p&gt;

&lt;p&gt;How do we solve this? Currently all the required pointers inside &lt;code class="language-plaintext highlighter-rouge"&gt;gal_data_t&lt;/code&gt; are passed as seperate arguments to the kernel. After a careful study of the internal implementation of the &lt;code class="language-plaintext highlighter-rouge"&gt;cl_mem&lt;/code&gt; object, we’ll see if we can directly pass the &lt;code class="language-plaintext highlighter-rouge"&gt;gal_data_t&lt;/code&gt; to the kernel.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Data Transfer Overhead&lt;/strong&gt; : As mentioned multiple times, for using GPUs, we must copy data to and from the GPU memory. Astronomical datasets are huge, and copying them for each operation is a big overhead! Infact the data transfer overhead is so huge, that the actual operation is much faster than the data transfer. Adding more to that, its not just faster, its much much faster! So much so that around 95% of the time is spent in copying data to and from the GPU memory. It reduces performance by ~100x! It can’t continue this way!&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;One solution we’ve figured is, when the External data is loaded for the first time in the program, we load it on the GPU memory instead of the CPU memory. This way, for each subsequent operation, we dont have to copy the data from CPU to GPU memory. After all the operations are done, we’ll copy the result back to CPU memory and save it to the disk. This will avoid almost all the Data Transfer overhead.&lt;/p&gt;

&lt;p&gt;This is about the same approach used by Machine Learning Libraries such as Tensorflow. Basically during initialization, it occupies all the GPU memory it can, and keeps it occupied. All the operations, their results and the subsequent operations are done on the GPU memory itself.&lt;/p&gt;</description><category>gnuastro</category><guid>http://openastronomy.org/Universe_OA/posts/2023/08/20230812_0000_labeeb-7z/</guid><pubDate>Fri, 11 Aug 2023 23:00:00 GMT</pubDate></item><item><title>Moving towards OpenCL</title><link>http://openastronomy.org/Universe_OA/posts/2023/07/20230728_0000_labeeb-7z/</link><dc:creator>Labib Asari</dc:creator><description>&lt;h3 id="background"&gt;Background&lt;/h3&gt;

&lt;p&gt;So far, all my work on GPUs has been using CUDA. But CUDA is proprietary to NVIDIA and only works on NVIDIA GPUs. So, I’ve been working on moving the code to OpenCL, which is an open standard for parallel programming on heterogeneous systems.&lt;/p&gt;
&lt;!-- TEASER_END --&gt;

&lt;h3 id="opencl"&gt;OpenCL&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://www.khronos.org/opencl/"&gt;OpenCL&lt;/a&gt;(Open Computing Language) is an open standard for cross-platform, parallel programming of diverse accelerators(CPUs, GPUs, FPGAs, etc) found in supercomputers, cloud servers, personal computers, mobile devices and embedded platforms.
Note the 2 key points -&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code class="language-plaintext highlighter-rouge"&gt;open standard&lt;/code&gt; : this means that the specification and documentation of the technology are publicly available and can be accessed by anyone.&lt;/li&gt;
&lt;li&gt;&lt;code class="language-plaintext highlighter-rouge"&gt;cross-platform&lt;/code&gt; : this means that it can run on multiple operating systems and hardware architectures without requiring major modifications to the code.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This makes OpenCL a very attractive option for developers who want to write code that can run on a wide range of devices. From Gnuastro’s perspective, this means that we can write code that can run on multiple GPU manufactureres, as well as CPUs and other accelerators. Our GPU kernels will be portable to any system, regardless of its configuration!&lt;/p&gt;

&lt;p&gt;Next point to consider is OpenCL is a &lt;code class="language-plaintext highlighter-rouge"&gt;standard&lt;/code&gt;. It is different from CUDA in this regard. CUDA is a framework, whereas OpenCL is a standard. What does this mean?&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The OpenCL standard refers to the specification and guidelines set forth by the Khronos Group which is responsible for developing and maintaining the standard. The OpenCL standard defines the API, data types, functions, and programming model that developers must follow when writing code for OpenCL. It is a formal document that ensures uniformity and compatibility across different OpenCL implementations.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;OpenCL is not an open-source library! It basically defines how the library should behave(big simplification!).&lt;/p&gt;

&lt;p&gt;So what can we do with the standard alone? Not much! We need an &lt;code class="language-plaintext highlighter-rouge"&gt;implementation of the standard&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;This also reminds me of the question I once had - What do you need to create a new programming language?
My first guess was a compiler! My thought process was if a program(compiler in this case) can understand my High level language and convert it to corresponding machine code, then I can write programs in that high level language for any task!
So all I’d need is a compiler for that language.
Its close, but not totally accurate.&lt;/p&gt;

&lt;p&gt;You dont actually need a compiler for a new programming language. You ONLY need a &lt;code class="language-plaintext highlighter-rouge"&gt;specification&lt;/code&gt; for it. The specification will define the syntax and semantics(rules) of the language.
You only need a compiler when you want to run programs using your language!(what good is a language if you cant run programs using it? haha)&lt;/p&gt;

&lt;p&gt;Similaraly OpenCL defines a set of rules which specify how it will behave. But to use OpenCL we need an implementation of this standard.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;OpenCL implementations are software packages developed by hardware manufactureres that provide the necessary drivers and runtime libraries for running OpenCL applications on their specific hardware. Each hardware vendor is responsible for creating their own OpenCL implementation that conforms to the OpenCL standard. This means that each implementation may have its own unique features and quirks, but they all adhere to the same standard.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;There are many different implementations available for it! (find the full list &lt;a href="https://www.khronos.org/conformance/adopters/conformant-products/opencl"&gt;here&lt;/a&gt; or &lt;a href="https://www.iwocl.org/resources/opencl-implementations/"&gt;here&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;Basically each of the hardware manfacturers provide an implementation of the OpenCL standard for their hardware. This implementation is usually provided as a framework. Depending on what hardware you have on your system, you can choose the corresponding framework to use.&lt;/p&gt;

&lt;h3 id="how-does-opencl-work"&gt;How does OpenCL work?&lt;/h3&gt;

&lt;p&gt;Here’s waht a typical OpenCL system looks like :&lt;/p&gt;

&lt;p&gt;&lt;img alt="opencl-sytem" src="https://labeeb-7z.github.io/Blogs/img/posts/opencl/opencl-system.png"&gt;&lt;/p&gt;

&lt;p&gt;OpenCL programs consist of two parts: host code and device code. The host code is written in C or C++ and runs on the host, while the device code is written in OpenCL C and runs on the device. The host code is responsible for setting up the OpenCL environment, creating the context, compiling the device code, and executing the kernels on the device.&lt;/p&gt;

&lt;p&gt;The device code is compiled at runtime by the host code. This means that the host code must be compiled first, and then the device code can be compiled. The host code is compiled using a standard C/C++ compiler, while the device code is compiled using the OpenCL compiler. The OpenCL compiler is provided by the OpenCL implementation and is responsible for compiling the device code into binary code that can be executed on the device.&lt;/p&gt;

&lt;p&gt;How does the OpenCL library interact with the hardware? Its made possible through OpenCL-ICD.&lt;/p&gt;

&lt;p&gt;OpenCL ICD stands for OpenCL Installable Client Driver. It is a component of the OpenCL&lt;/p&gt;

&lt;p&gt;It enables multiple manufacturers OpenCL drivers to coexist on a single system. Instead of having a single monolithic OpenCL driver, an ICD allows different manufactureres (e.g., NVIDIA, AMD, Intel) to provide their own separate OpenCL implementation as dynamically loadable libraries. This means that developers can select the appropriate OpenCL driver at runtime without needing to modify their applications.&lt;/p&gt;

&lt;p&gt;The ICD mechanism is crucial for achieving portability and flexibility in developing applications using computational power of various devices from different manufacturers.&lt;/p&gt;

&lt;h3 id="opencl-programming-model"&gt;OpenCL Programming Model&lt;/h3&gt;

&lt;p&gt;The Programming Model of OpenCL is very similar to CUDA which I covered in my previous post. However CUDA has a lot of abstraction since it has its own runtime library which communicates with the driver.
In OpenCL there’s direct communication with the drivers and the host code is responsible for setting up the environment so its a bit more lower level than CUDA.&lt;/p&gt;

&lt;p&gt;Some of the key terms in OpenCL are :&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Work Item: Basic unit of work on a compute device&lt;/li&gt;
&lt;li&gt;Kernel: The code that runs on a work item (Basically a C function)&lt;/li&gt;
&lt;li&gt;Program: Collection of kernels and other functions&lt;/li&gt;
&lt;li&gt;Context: The environment where work-items execute (Devices, their memories and command queues)&lt;/li&gt;
&lt;li&gt;Command Queue: Queue used by the host to submit work (kernels, memory copies) to the device.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I’ll cover the programming aspect of OpenCL in more detail in my next post.&lt;/p&gt;</description><category>gnuastro</category><guid>http://openastronomy.org/Universe_OA/posts/2023/07/20230728_0000_labeeb-7z/</guid><pubDate>Thu, 27 Jul 2023 23:00:00 GMT</pubDate></item><item><title>GPUs and Convolutions in Gnuastro</title><link>http://openastronomy.org/Universe_OA/posts/2023/07/20230704_0000_labeeb-7z/</link><dc:creator>Labib Asari</dc:creator><description>&lt;h3 id="background"&gt;Background&lt;/h3&gt;

&lt;p&gt;This is an overview of what I’ve been upto for the past 2 weeks. Doesn’t go into much technical details and the actual code but just walks through the general idea.&lt;/p&gt;
&lt;!-- TEASER_END --&gt;

&lt;p&gt;&lt;a href="https://en.wikipedia.org/wiki/Convolution"&gt;Convolution&lt;/a&gt;  is a fundamental operation in various domains, such as image processing, signal processing, and deep learning. It is an important module in Gnuastro and is also used as a subroutine in other modules.&lt;/p&gt;

&lt;p&gt;Convolutional operations can be broken down into smaller tasks, such as applying the kernel to different portions of the input data. By utilizing multiple threads, each thread can independently process a subset of the input, reducing the overall execution time. This parallelization technique is particularly effective when dealing with large input tensors or performing multiple convolutions simultaneously.&lt;/p&gt;

&lt;p&gt;While traditional CPUs (Central Processing Units) excel at performing a wide range of tasks, they are not specifically designed for heavy parallel computations like convolutions. On the other hand, GPUs (Graphics Processing Units) are highly optimized for parallel processing, making them ideal for accelerating convolutional operations.&lt;/p&gt;

&lt;h3 id="gpus-vs-cpus-architecture"&gt;GPUs vs CPUs Architecture&lt;/h3&gt;
&lt;p&gt;&lt;img alt="Architecture difference" src="https://labeeb-7z.github.io/Blogs/img/posts/gpus/architecture.png"&gt;&lt;/p&gt;

&lt;h4 id="cores-and-parallelism-"&gt;Cores and Parallelism :&lt;/h4&gt;
&lt;p&gt;CPUs have fewer, more powerful cores optimized for sequential processing, while GPUs have thousands of smaller cores designed for parallel processing. This parallelism allows GPUs to perform computations on multiple data elements simultaneously, leading to significant speedup in parallelizable tasks like graphics rendering and deep learning.&lt;/p&gt;

&lt;h4 id="memory-hierarchy-"&gt;Memory Hierarchy :&lt;/h4&gt;
&lt;p&gt;CPUs typically have larger caches and more advanced memory management units (MMUs), focusing on low-latency operations and complex branch prediction. GPUs, prioritize high memory bandwidth and utilize smaller caches to efficiently handle large amounts of data simultaneously, crucial for tasks like image processing and scientific simulations.&lt;/p&gt;

&lt;h4 id="emphasis-"&gt;Emphasis :&lt;/h4&gt;
&lt;p&gt;CPUs are designed with an emphasis on executing single threads - very fast. GPUs are designed with an emphasis on executing on executing multiple threads.&lt;/p&gt;

&lt;h3 id="programming-model"&gt;Programming Model&lt;/h3&gt;
&lt;p&gt;For Programming GPUs, several frameworks (high level APIs) are available&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;CUDA - developed by NVIDIA for its GPUs.&lt;/li&gt;
&lt;li&gt;OpenCL - Open Source, Cross Platform parallel programming standard for diverse accelerators.&lt;/li&gt;
&lt;li&gt;HIP - developed by AMD, portable.&lt;/li&gt;
&lt;li&gt;and many more….&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id="cuda"&gt;CUDA&lt;/h3&gt;

&lt;h4 id="the-cuda-platform-consists-of-a-programming-language-a-compiler-and-a-runtime-library"&gt;The CUDA platform consists of a programming language, a compiler, and a runtime library.&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code class="language-plaintext highlighter-rouge"&gt;Programming Language&lt;/code&gt; - Based on C, has extensions to write code for GPU.&lt;/li&gt;
&lt;li&gt;&lt;code class="language-plaintext highlighter-rouge"&gt;Compiler&lt;/code&gt; - Based on clang, offloads host code to system compiler and translates device code into binary code that can be executed on the GPU.&lt;/li&gt;
&lt;li&gt;&lt;code class="language-plaintext highlighter-rouge"&gt;Runtime Library&lt;/code&gt; - Provides the necessary functions and tools to manage the execution of the code on the GPU (interacts with the driver).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Note : When we have multiple devices(GPUs, FPGAs, etc) on a single system, which can execute tasks apart from the main CPU, they’re generally referred to as &lt;code class="language-plaintext highlighter-rouge"&gt;device&lt;/code&gt; whereas the main CPU is referred to as &lt;code class="language-plaintext highlighter-rouge"&gt;host&lt;/code&gt;.&lt;/p&gt;

&lt;h3 id="cuda-programs"&gt;CUDA Programs&lt;/h3&gt;

&lt;p&gt;CUDA programs consists of normal host code along with some &lt;code class="language-plaintext highlighter-rouge"&gt;kernels&lt;/code&gt;.
Kernels are like other functions, but when you call a kernel, they’re executed N times parallely by N different CUDA threads, as opposed to only once like normal functions. They’re defined using the &lt;code class="language-plaintext highlighter-rouge"&gt;__global__&lt;/code&gt; keyword.&lt;/p&gt;

&lt;p&gt;Eg :
&lt;img alt="kernel example" src="https://labeeb-7z.github.io/Blogs/img/posts/gpus/kernel.png"&gt;&lt;/p&gt;

&lt;p&gt;Normally, we put the above piece of code inside a loop, so all elements are covered.&lt;/p&gt;

&lt;p&gt;With GPUs, there’s no need for loops - for N elements, we launch N threads each of which add 1 element at the same time!&lt;/p&gt;

&lt;h3 id="cuda-execution-configuration"&gt;CUDA Execution Configuration&lt;/h3&gt;

&lt;p&gt;Can we launch an arbitrary large number of threads?
Technically No&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The maximum allowed threads depend on your GPUs compute capability.&lt;/li&gt;
&lt;li&gt;But generally it’s so large, it always covers all your elements&lt;/li&gt;
&lt;li&gt;For Compute Capability &amp;gt; 3.0
&lt;ul&gt;
&lt;li&gt;Max Number of threads : (2^31)&lt;em&gt;(2^16)&lt;/em&gt;(2^16)&lt;em&gt;(2&lt;/em&gt;10) = 2^42!&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h4 id="threads-and-blocks-"&gt;Threads and Blocks :&lt;/h4&gt;

&lt;p&gt;&lt;img alt="Threads and Blocks" src="https://labeeb-7z.github.io/Blogs/img/posts/gpus/config.png"&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;All threads are organized into groups called - Block.&lt;/li&gt;
&lt;li&gt;All blocks are organized into groups called - Grid.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Blocks and Grids could be a 1D, 2D or 3D structures.&lt;/p&gt;

&lt;p&gt;When calling a GPU kernel, we specify the structure of each block, number of blocks, and number of threads/block - This is called the Execution Configuration.&lt;/p&gt;

&lt;p&gt;Example :
&lt;img alt="Launching a kernel example" src="https://labeeb-7z.github.io/Blogs/img/posts/gpus/launch-kernel.png"&gt;&lt;/p&gt;

&lt;p&gt;The above code Launches
32&lt;em&gt;32&lt;/em&gt;1 = 1024 blocks
Each having 16&lt;em&gt;16 = 256 threads
Total no. of threads = 1024&lt;/em&gt;256.&lt;/p&gt;

&lt;h3 id="cuda-memory-hierarchy"&gt;CUDA Memory Hierarchy&lt;/h3&gt;

&lt;p&gt;&lt;img alt="Memory Hierarchy" src="https://labeeb-7z.github.io/Blogs/img/posts/gpus/memory.png"&gt;
CUDA threads may access data from multiple memory spaces during their execution as illustrated above.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Local memory for each thread.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Shared memory b/w all threads of same block.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Global memory b/w all blocks.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id="cuda-hardware-abstraction"&gt;CUDA Hardware abstraction&lt;/h3&gt;
&lt;p&gt;&lt;img alt="Hardware Abstraction" src="https://labeeb-7z.github.io/Blogs/img/posts/gpus/hardware.png"&gt;&lt;/p&gt;

&lt;p&gt;The entire GPU is divided into several Streaming MultiProcessors (SMs). They have different architecture than a typical CPU core. Each SM has several CUDA cores, which are the actual processing units.&lt;/p&gt;

&lt;p&gt;It is designed with SIMT/SIMD philosophy, which allow execution of multiple threads concurrently on them. One Block is executed at a time on a single SM.&lt;/p&gt;

&lt;h3 id="cuda-developing-workflow"&gt;CUDA Developing Workflow&lt;/h3&gt;
&lt;p&gt;&lt;img alt="Workflow" src="https://labeeb-7z.github.io/Blogs/img/posts/gpus/workflow.png"&gt;&lt;/p&gt;

&lt;h3 id="results-of-convolution-on-gpu-for-gnuastro"&gt;Results of Convolution on GPU for Gnuastro&lt;/h3&gt;

&lt;p&gt;All tests were performed on a system with the following specifications:&lt;/p&gt;

&lt;p&gt;CPU :&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Intel(R) Core(TM) i5-9300HF CPU @ 2.40GHz&lt;/li&gt;
&lt;li&gt;Thread(s) per core:  2&lt;/li&gt;
&lt;li&gt;Core(s) per socket:  4&lt;/li&gt;
&lt;li&gt;Socket(s):           1&lt;/li&gt;
&lt;li&gt;CPU max MHz:         4100.0000&lt;/li&gt;
&lt;li&gt;CPU min MHz:         800.0000&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;GPU :&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;NVIDIA GeForce GTX 1650&lt;/li&gt;
&lt;li&gt;Turing Architecture&lt;/li&gt;
&lt;li&gt;Driver Version:      535.54.03&lt;/li&gt;
&lt;li&gt;CUDA Version:        12.2&lt;/li&gt;
&lt;li&gt;VRAM :               4GB&lt;/li&gt;
&lt;li&gt;Compute Capability : 7.5&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The input image was a 10k x 20k FITS file with 32-bit floating point values. The kernel was a 3x3 matrix with 32-bit floating point values.&lt;/p&gt;

&lt;h4 id="cpu-multi-threaded"&gt;CPU Multi-threaded&lt;/h4&gt;

&lt;p&gt;&lt;img alt="CPU" src="https://labeeb-7z.github.io/Blogs/img/posts/gpus/cpu-result.png"&gt;&lt;/p&gt;

&lt;h4 id="gpu"&gt;GPU&lt;/h4&gt;

&lt;p&gt;&lt;img alt="GPU" src="https://labeeb-7z.github.io/Blogs/img/posts/gpus/gpu-result.png"&gt;&lt;/p&gt;

&lt;p&gt;The overall speedups seems to only be 6X but this also counts the time taken to transfer the data from CPU to GPU and back. If we only consider the time taken to perform the convolution, the speedup is around ~700X!.&lt;/p&gt;</description><category>gnuastro</category><guid>http://openastronomy.org/Universe_OA/posts/2023/07/20230704_0000_labeeb-7z/</guid><pubDate>Mon, 03 Jul 2023 23:00:00 GMT</pubDate></item></channel></rss>