Usage

In the main, one uses pyfive exactly as one would use h5py so the documentation for h5py is also relevant. However, pyfive has some additional API features and optimisations which are noted in the section on “Additional API Features”.

Note

The pyfive API does not FULLY implement the entire h5py API, in particular, there is no support for writing files, and most of the low-level h5py API is not implemented. That said, if you find a case where the high level h5py API read functionality is not supported by pyfive, please report it as an issue on our GitHub Issues page.

Working with Files

pyfive provides a high-level interface to HDF5 files, similar to h5py. You can open an HDF5 file and access its contents using dictionary-like syntax, and there is support for lazy loading of datasets on both Posix and S3 filesystems.

Here is a simple example of how to open an HDF5 file and read its contents using pyfive:

import pyfive

# Open the file in read-only mode
with pyfive.File("data.h5", "r") as f:
    # List the top-level groups and datasets
    print("Keys:", list(f.keys()))

    # Access a group
    grp = f["/my_group"]

    # List items inside that group
    print("Items in /my_group:", list(grp.keys()))

    # Access a dataset and inspect its shape and dtype
    dset = grp["my_dataset"]
    print("Shape:", dset.shape)
    print("Data type:", dset.dtype)

    # Read an entire dataset into a NumPy array
    data = dset[...]
    print("First element:", data[0])

In this example:

pyfive.File opens the file, returning a file object that behaves like a Python dictionary.
Groups (Group objects) can be accessed using dictionary-like keys.
Datasets (Dataset objects) expose attributes like shape and dtype which are loaded when you list them, but the data itself is not loaded from stroage into numpy arrays until you access it. (Lazy loading is discussed in more detail in the section on “Optimising Access Speed”.)

Note

If you are used to working with NetCDF4 files (and maybe netcdf4-python), the concept of a File in pyfive corresponds to a NetCDF4 Dataset (both are read from an actual file), and the HDF5/pyfive/h5py concept of a Dataset corresponds to a NetCDF Variable (the notion of a group is semantically similar in both cases!).

Working with datasets

Most of the time, you will access datasets in a similar way to how you would with h5py. You can read data from a dataset using slicing, and you can also access attributes associated with the dataset. Here is an example:

import pyfive

with pyfive.File("data.h5", "r") as f:
    # Access a dataset
    dset = f["/my_group/my_dataset"]

    # Read a slice of the dataset
    data_slice = dset[10:20]
    print("Data slice:", data_slice)

    # Access attributes of the dataset
    print("Attributes:", dset.attrs)

One notable feature of pyfive is that the variable dset which we have just created is available outside of the context manager (i.e. after the with block). This means you can close the file and still work with the dataset, as long as you have instantiated it before closing the file. This is particularly useful for working with large datasets in a parallel environment where you might want to close the file to free up resources while still needing to access some of the data .

Note

This functionality depends on the fact that the attributes and chunk index of the dataset are read when you first access it, so you can continue to use the dataset after closing the file. This is fully lazy (in that no data is read until needed) and thread-safe, and we have tests to ensure that this behavior works correctly even in multi-threaded scenarios.

Using S3/Object Storage

pyfive is designed to work seamlessly with both local filesystems and S3-compatible object storage (and probably any remote storage that supports the fsspec API). However, there are some additional considerations when working with S3, the most important of which is the need to use the s3fs library to provide a filesystem interface to S3.

Here is a simple example of how to open an HDF5 file stored in S3 and read its contents using pyfive:

import pyfive
import s3fs


S3_URL = "https://my-s3-place.ac.uk"
filename = "filename.nc"
blocks_MB = 1  # Set the block size for S3 access
s3params = {
    'endpoint_url': S3_URL,
    'default_fill_cache':False,
    'default_cache_type':"readahead",
    'default_block_size': blocks_MB * 2**20
}
fs = s3fs.S3FileSystem(anon=True, **s3params)

# now we can open the file using the S3 filesystem
uri = 's3-bucket/' + filename
with fs.open(uri, 'rb') as s3file:
    with pyfive.File(s3file, "r") as f:
        dset = f['variable']
    data = dset[10:20]
    print("Data slice from S3:", data)

Note

The ideal s3fs parameters to use (s3params) will depend on what you are actually doing with the file, as discussed in the section on “Optimising Access Speed”. The parameters shown above work well for accessing small amounts of data from a large file, but you may need to adjust them for your specific use case.

# FIXME: Check if the following is still accurate This example also shows that while it is possible to close the file access context manager and still access the datasets, you will need to ensure that the S3 filesystem is still available.