Additional API Features

In this section we highlight the additional API features and optimisations that pyfive provides beyond the standard h5py functionality.

Modifications to the File API

When acccessing a file, in addition there are two modifications to the standard h5py API that can be used to optimise performance. A new method (get_lazy_view) and an additional keyword argument on visititems (noindex) are provided to support access to all dataset metadata without loading chunk indices.

Note

Loading chunk indices at dataset instantiation is mostly a useful optimisation, but not if you have no intent of accessing the data itself.

The Group API is fully documented in the autogenerated API reference, but the additional methods and keyword arguments are highlighted here. These methods are also avilable on the File class, since File is a subclass of Group.

Group.get_lazy_view(y: object) → object[source]

This instantiates the object y, and if it is a chunked dataset, does so without reading the b-tree index. This is useful for inspecting a variable that you are not expecting to access. If you know you want to access the data, and in particular, if you are going to hand the data to Dask or something else, you almost certainly want to read the index now, so just do x[y] rather than x.get_lazy_view(y).

This is a pyfive extension to the standard h5py API.

Group.visititems(func: Callable, noindex: bool = False) → object[source]

Recursively visit all objects in this group and subgroups.

func should be a callable with the signature:

func(name, object) -> None or return value

Returning None continues iteration, return anything else stops and return that value from the visit method.

Use of the optional noindex=True will ensure that all operations are not only lazy wrt data, but lazy wrt to any chunked data indices. This keyword argument is a pyfive extension to the standard h5py API.

Modifications to the DatasetID API

When accessing datasets, additional functionality is exposed via the pyfive.h5d.DatasetID class, which implements the low-level data access methods for datasets (Variables).

The DatasetID API is fully documented in the autogenerated API reference, but additional methods and attributes are highlighted here:

DatasetID.first_chunk

The integer address of the first data chunk for this variable.

This property may be of use in understanding the read performance of chunked data in object stores. first_chunk is a pyfive API extension.

DatasetID.btree_range: A tuple with the addresses of the first b-tree node for this variable, and the address of the furthest away node (Which may not be the last one in the chunk index). This property may be of use in understanding the read performance of chunked data in object stores. btree_range is a pyfive API extension.

DatasetID.set_pseudo_chunk_size(newsize_MB)[source]: Set pseudo chunking size for contiguous variables. This is a pyfive API extension. The default value is 4 MB which should be suitable for most applications. For arrays smaller than this value, no pseudo chunking is used. Larger arrays will be accessed in in roughly newsize_MB reads.