Adding a New Format

This page details how to add IO support for a new format to DASCore. The steps are:

  1. Create a new module, subclass FiberIO, and implement the appropriate methods.

  2. Find a small test file to include in DASCore’s test suite.

  3. Register the new FiberIO subclass(es) for generic tests.

  4. (Optional) Write format specific tests for the new format.

  5. Register the new FiberIO subclasses with DASCore’s plugins.

To demonstrate this process, imagine adding support for a format called jingle which conventionally uses a file extension of jgl.

Adding a New IO Module

First, we create a new io module called ‘jingle’ in DASCore’s io module (dascore/io/jingle). Make sure there is a __init__.py file in this module whose docstring describes basic use of the format and lists any non-obvious implementation details that might debug/improve the parser.

contents of dascore/io/jingle/__init__.py

"""
Jingle format support module.

Jingle is a really cool new DAS format.

It supports all the "bells" and whistles.

Examples
--------
import dascore as dc

jingle = dc.spool('path_to_file.jgl')
"""
'\nJingle format support module.\n\nJingle is a really cool new DAS format.\n\nIt supports all the "bells" and whistles.\n\nExamples\n--------\nimport dascore as dc\n\njingle = dc.spool(\'path_to_file.jgl\')\n'

Next, create a core.py file in the new module (dascore/io/jingle/core.py). Start by creating a class called JingleIOV1 which subclasses (dascore.io.core.FiberIO)[dascore.io.core.FiberIO]. Now, on your subclass, you need to implement the supported methods.

Contents of dascore/io/jingle/core.py:

"""
Core module for jingle file format support.
"""
import dascore.exceptions
from dascore.io.core import FiberIO


class JingleV1(FiberIO):
    """
    An IO class supporting version 1 of the jingle format.
    """
    # you must specify the format name using the name attribute
    name = 'jingle'
    # you can also define which file extensions are expected like so.
    # this will speed up DASCore's automatic file format determination.
    preferred_extensions = ('jgl',)
    # also specify a version so when version 2 is released you can
    # just make another class in the same module named JingleV2.
    version = '1'

    def read(self, path, jingle_param=1, **kwargs):
        """
        Read should take a path and return a patch or sequence of patches.

        It can also define its own optional parameters, and should always
        accept kwargs. If the format supports partial reads, these should
        be implemented as well.
        """

    def get_format(self, path, **kwargs):
        """
        Used to determine if path is a supported jingle file.

        Returns a tuple of (format_name, file_version) if the file is a
        supported jingle file, else return False or raise a
        dascore.exceptions.UnknownFiberFormat exception.
        """

    def scan(self, path, **kwargs):
        """
        Used to get metadata about a file without reading the whole file.

        This should return a list of
        [`PatchAttrs`](`dascore.core.attrs.PatchAttrs`) objects
        from the [dascore.core.attrs](`dascore.core.attrs`) module, or a
        format-specific subclass.
        """

    def write(self, patch, path, **kwargs):
        """
        Write a patch or spool back to disk in the jingle format.
        """

All 4 methods are optional; some formats will only support reading, others only writing. If is_format is not implemented the format will not be auto-detectable, meaning you will have to manually pass the format to read and spool.

Note

Note that each of these methods should have **kwargs at the end. This allows certain formats to have special keywords while others are not required.

Warning

It is very important that the scan method returns exactly the same patch information as reading the patch would, otherwise the lazy merge planning done by spool can be wrong!

Support for Streams/Buffers

Rather than using paths for the IO methods as shown above, it is better practice to write a FiberIO which supports the python stream interface or an opened HDF5 file in the form of a pytables.File or h5py.File object. There are a few reasons for this:

  • More types of inputs can be supported, including steaming file contents from the web or in-memory streams like BytesIO.
  • It is usually more efficient since open-file handles can be automatically reused.

To make this easy, DASCore will automatically manage and serve the right input to FiberIO methods based on type hints. Here are the ones currently supported, all of which are imported from dascore.io:

  1. BinaryReader - A stream-like object which must have a read and seek method.

  2. BinaryWriter - A stream-like object which must have a write method.

  3. H5Reader - An instance of h5py.File which is open in read mode.

  4. H5Writer - An instance of h5py.File which is open in append mode.

  5. PyTablesReader - An instance of pytables.File which is open in read mode.

  6. PyTablesWriter - An instance of pytables.File which is open in append mode.

Deciding which to use depends on whether the file is an HDF5-based or binary format, and which hdf5 library you want to use.

Note

H5Reader should be preferred to PytablesReader, especially for the get_format method.

Note

If a type hint other than the ones listed above is given to the relevant parameter (path, or resource in these examples) it will have no effect.

Assuming Jingle is a binary file format, here is an implementation which supports binary streams (only showing the read method for brevity):

"""
Core module for jingle file format support.
"""
import io

import dascore.exceptions
from dascore.io import FiberIO, BinaryReader, BinaryWriter


class JingleV1(FiberIO):
    """
    An IO class supporting version 1 of the jingle format.
    """
    name = 'jingle'
    preferred_extensions = ('jgl',)
    version = '1'

    def read(self, resource: BinaryReader, jingle_param=1, **kwargs):
        """
        get_format now accepts a stream, which DASCore will ensure is provided.
        """
        # raise an error if we get the wrong type
        assert isinstance(resource, io.BufferedReader)
        # read first 50 bytes (maybe they have header info)
        first_50_bytes = resource.read(50)
        # seek back to byte 20
        resource.seek(20)
        # etc.

Now, whether we call dascore.read or JingleV1.read a readable binary stream will be provided to our implementation. For example:

from pathlib import Path

import numpy as np

import dascore as dc

path = Path("test_numpy_binary_file.npy")
# make a binary file to read
array = np.random.random(100)
np.save(path, array)

out = dc.read(path, file_format="jingle", file_version="1")
jingle_io = JingleV1()

out = jingle_io.read(path)

path.unlink()  # cleanup test file

Writing Tests

Next we need to write tests for the format (you weren’t thinking of skipping this step were you!?). The hardest part of testing new file formats is finding a small (typically no more than 10 ish mb) file to include in the test suite. Once you have such a file, Adding test data details how to add it to the registry.

Once the test file is added to the data registry, you can register the new format so a suite of tests run automatically. This is done by adding the format to the appropriate data structures in tests/test_io/test_common_io.py. The comments at the top of the file will guide you through this process.

For some formats, the generic tests will be sufficient. For others, additional test cases may be required. These are placed in test_io folder. In our example, we would create the folder dascore/tests/test_io/test_jingle. Assuming you added a file called “jingle_test_file.jgl” to DASCore’s data registry, then we could create the test file dascore/tests/test_io/test_jingle/test_jingle.py and its contents might look something like this:

import pytest

import dascore
from dascore.utils.downloader import fetch
from dascore.io.jingle import JingleV1

@pytest.fixture(scope='class')
def jingle_file_path():
    """Return the path to the test jingle file."""
    # fetch will ensure the data file is downloaded and cached.
    path = fetch("jingle_test_file.jgl")
    return path


class TestJingleIO:
    """Tests specific to the jingle IO format."""

    def test_issue_xx(self, jingle_file_path):
        """Test to capture a specific reported issue with this format."""
        ...

    def test_read_option(self, jingle_file_path):
        """Tests for a jingle-specific read option"""
        jingle = JingleV1()
        patch = jingle.read(jingle_file_path, special_option=2)
        ...

Register Plugin

Now that the Jingle format support is implemented and tested, the final step is to register the jingle FiberIO subclasses in DASCore’s entry points. This is done under the [project.entry-points.”dascore.fiber_io”] section in DASCore’s pyproject.toml file. For example, after adding jingle, the pyproject.toml section might look like this:

[project.entry-points."dascore.fiber_io"]
TERRA15__V4 = "dascore.io.terra15.core:Terra15FormatterV4"
PICKLE = "dascore.io.pickle.core:PickleIO"
WAV = "dascore.io.wav.core:WavIO"
DASDAE__V1 = "dascore.io.dasdae.core:DASDAEV1"
JINGLE__V1 = "dascore.io.jingle.core:JingleV1"
Note

The name and version of the format are separated by a double underscore.

Directories as Inputs

Some FiberIO formats may not be self-contained files, but rather must be understood in the context of an entire directory. In these cases, the input_type parameter on the FiberIO subclass should be set to “directory”. See the xml_binary module for an example of a directory based FiberIO implementation.

Warning

DASCore assumes a directory-based FiberIO does not have any sub patch files of a different format. Once a valid FiberIO directory is found, contents of the directory are no longer searched for Patch files.