Introduction

About Pyfive

pyfive provides a pure Python HDF reader which has been designed to be a thread-safe drop in replacement for h5py with no dependencies on the HDF5 C library. It aims to support the same API as h5py for reading files. Cases where access to a file uses a feature that is supported by the high-level h5py interface but not pyfive are considered bugs and should be reported in our Issues.

Writing HDF5 output is not a goal of pyfive and portions of the h5py API which apply only to writing will not be implemented.

Note

While pyfive is designed to be a drop-in replacement for h5py, the reverse may not be possible. It is possible to perform actions with pyfive that are not supported by h5py as pyfive extends the h5py API beyond its initial specifications. This documentation makes clear which parts of the API are extensions and where behaviour differs by design from h5py.

The motivations for pyfive development were many, but recent developments prioritised thread-safety, lazy loading, and performance at scale in a cloud environment both standalone, and as a backend for other software such as cf-python, xarray, and h5netcdf.

As well as the high-level h5py API we have implemented a version of the h5d.DatasetID class, which now holds all the code which is used for data access (as opposed to attribute access). We have also implemented extra methods (beyond the h5py API) to expose the chunk index directly (as well as via an iterator) and to access chunk info using the zarr indexing scheme rather than the h5py indexing scheme. This is useful for avoiding the need for a priori use of kerchunk to make a zarr index for a file.

The code also includes an implementation of what we have called pseudochunking which is used for accessing a contiguous array which is larger than memory via S3. In essence all this does is declare default chunks aligned with the array order on disk and use them for data access.

There are optimisations to support cloud usage, the most important of which is that once a variable is instantiated (i.e. for an open pyfive.File instance f, when you do v=f['variable_name']) the attributes and b-tree` (chunk index) are read, and it is then possible to close the parent file (f), but continue to use (v).

The package also includes a command line tool (p5dump) which can be used to dump the contents of an HDF5 file to the terminal.

Note

We have test coverage that shows that the usage of v in this way is thread-safe - the test which demonstrates this is slow, but it needs to be, since shorter tests did not always exercise expected failure modes.

The pyfive test suite includes all the components necessary for testing pyfive accessing data via both POSIX and S3.