astropy@GSoC Blog Post #6, Week 8&9

Heads-up about the Progress of #11897

In summary the situation of the concerned PR a few days back was 4 types of CI test errors, one bug and possibly a need for modification of part of the code copied from pycdsreadme. All these have been taken care of as detailed below, but for the numpy depreciation warnings that keep coming up. I don't think we can do anything about the latter's persistence as of now. I shall comment more about it on GitHub as well.
  1. File not found error: Moritz's HW, i.e. using get_pkg_data_filename import, directly took care of this.
  2. Error in coord col decimal places: The precision of the coordinate component columns was getting set arbitrarily, which created difference in the output for 32-bit and 62-bit machines, and possibly between different operating systems. This has been corrected by having a fixed number of 12 digits after decimal for RAs, DEs and the latitude/longitude columns of Galactic and Ecliptic coords. This error also relates with the Formats bug.
  3. SphericalRepresentation col error: Now, this was a bit major issue compared to the two above, although the solution was only 2 line changes. When the coords cols were checked for and divided into components, the original SkyCoord col was deleted right within the loop. This made the iteration index of the loop to point to i+2 column after deletion, where i is the index of the original SkyCoord col. That is, effectively skipping the immediate next column after the SkyCoord col, as it would have receded by one place in the list. Got this fixed by popping the original SkyCoord col after all the columns in the table have been iterated over. This way all object type columns are converted to Column objects with str values.
  4. ~table.tests and test_write failures: All these errors were warnings due to depreciation of numpy specific aliases for different Python types. Most previous tests in Astropy appear to use these now depreciated numpy types, which raises warnings during testing our code. I have been able to provide remedy for majority of these by additionally using np.issubdtype(col.dtype, np.integer) while checking if the columns has integer values, however, tests with oldest supported version of all dependencies still fails. See my GitHub comment for more info.

The
formats bug

This was another major problem we had stumbled upon. It took me a while to skim through various docs and codes to find the optimum fix for this.

Our initial insight was that the difference between the Byte-By-Byte description and the data part of the written table, when the formats argument is passed to the write function, related in some manner to the string formatting part of the code. By first look itself, it was evident that there isn't any provision in the writer for cases when the columns already contain a format attribute, which is what is assigned when formats is passed, as I had written here back then. Creating allowance for this was easy enough, right away correcting the test outputs. Now, both the Byte-By-Byte and the table data had the number of decimal digits, or whatever other format for that matter, we wanted them to have. Apart from the internally created coordinate component columns, for which the number of digits after decimal was fixed.

It is when we want to go a step further than this and wanna truncate or eradicate the string formatting part to obtain the column format, that we stumble upon a road block. There are two concerns,
  • If no formats argument is passed, col.format will be set to None.
  • Even if we already know the column format, say .5f, we still need to evaluate the maximum size of the value strings of the column in most cases, and do some formatting to have the format in CDS/MRT recommendation, Fx.5.
The column formats passed in the formats argument are set by using the in-build Python function format (https://docs.python.org/3/library/functions.html#format). For cases when no formats argument is passed, the default behavior when writing the table data, for instance in the FixedWidth writer is to set the column format to '' which is equivalent to saying val = str(val). (https://docs.astropy.org/en/stable/table/construct_table.html#table-format-string) FixedWidth uses the maximum length of these strings to get the column widths. So, there the string formatting part of the code is essential if we want to know the correct format for columns without string values.

However, there may be another solution to this that can be tried in the long-term. I was curious to know what other writers in Astropy did in such situations when the column format needs to be given explicitly in the header of the written table. There aren't extravagantly many such use cases, but the FITS standard tables do have format keywords in the header as serve the purpose well. So, looking over the Astropy FITS writer, I found the way in which it deals with the problem of assigning column formats is by separately defining all the formats that can be and then using a custom Column class which has some default format attributes. (See:
https://github.com/astropy/astropy/blob/main/astropy/io/fits/column.py). ASCII writers also have a custom Column class, but the attributes that it currently has are exceedingly lacking to be of any use to us now. (https://github.com/astropy/astropy/blob/79323de928e87827526ed8fce04986a5dd459794/astropy/io/ascii/core.py#L270) In the long-run, we could take motivation from the FITS writer and make changes herein.

Other updates

I have began to work on the other two branches for Time cols and MRT metadata resp and would have them done in some time.
On an unrelated note, I found that the test_cds_header_from_readme.py test file in astropy.io.ascii.tests contains some CDS reading tests. It was recently modified by the 11593 PR (https://github.com/astropy/astropy/pull/11593/files). I imagine that these tests can be incorporated within test_cds.py and then we won't perhaps have to move CDS/MRT tests to any other test file?