API Reference

File

class pyfive.File(filename: str | BinaryIO | MetadataBufferingWrapper, mode: str = 'r', metadata_buffer_size: int = 1)[source]

Bases: Group

Open a HDF5 file.

Note in addition to having file specific methods the File object also inherit the full interface of Group.

File is also a context manager and therefore supports the with statement. Files opened by the class will be closed after the with block, file-like object are not closed.

Parameters:
  • filename (str or file-like) – Name of file (string or unicode) or file like object which has read and seek methods which behaved like a Python file object.

  • mode (str) – File open mode (default: “r”, read-only).

  • metadata_buffer_size (int) – Size of metadata buffer for S3/remote files in MiB (default: 1MiB). Larger values reduce network calls but use more memory. (This is a pyfive extension for optimizing remote file access, ignored for local files.)

filename

Name of the file on disk, None if not available.

Type:

str

mode

String indicating that the file is open readonly (“r”).

Type:

str

userblock_size

Size of the user block in bytes (currently always 0).

Type:

int

Attributes:

attrs

attrs attribute.

consolidated_metadata

Returns True if all B-tree nodes for chunked datasets are located before the first chunk in the file.

Methods:

close()

Close the file.

get(k[,d])

get_lazy_view(y)

This instantiates the object y, and if it is a chunked dataset, does so without reading the b-tree index.

items()

keys()

values()

visit(func)

Recursively visit all names in the group and subgroups.

visititems(func[, noindex])

Recursively visit all objects in this group and subgroups.

property attrs

attrs attribute.

close()[source]

Close the file.

property consolidated_metadata: bool

Returns True if all B-tree nodes for chunked datasets are located before the first chunk in the file.

get(k[, d]) D[k] if k in D, else d.  d defaults to None.
get_lazy_view(y: object) object

This instantiates the object y, and if it is a chunked dataset, does so without reading the b-tree index. This is useful for inspecting a variable that you are not expecting to access. If you know you want to access the data, and in particular, if you are going to hand the data to Dask or something else, you almost certainly want to read the index now, so just do x[y] rather than x.get_lazy_view(y).

This is a pyfive extension to the standard h5py API.

items() a set-like object providing a view on D's items
keys() a set-like object providing a view on D's keys
values() an object providing a view on D's values
visit(func: Callable) object

Recursively visit all names in the group and subgroups.

func should be a callable with the signature:

func(name) -> None or return value

Returning None continues iteration, return anything else stops and return that value from the visit method.

visititems(func: Callable, noindex: bool = False) object

Recursively visit all objects in this group and subgroups.

func should be a callable with the signature:

func(name, object) -> None or return value

Returning None continues iteration, return anything else stops and return that value from the visit method.

Use of the optional noindex=True will ensure that all operations are not only lazy wrt data, but lazy wrt to any chunked data indices. This keyword argument is a pyfive extension to the standard h5py API.

Group

class pyfive.Group(name: str, dataobjects: DataObjects, parent: Group)[source]

Bases: Mapping

An HDF5 Group which may hold attributes, datasets, or other groups.

attrs

Attributes for this group.

Type:

dict

name

Full path to this group.

Type:

str

file

File instance where this group resides.

Type:

File

parent

Group instance containing this group.

Type:

Group

Attributes:

attrs

attrs attribute.

Methods:

get(k[,d])

get_lazy_view(y)

This instantiates the object y, and if it is a chunked dataset, does so without reading the b-tree index.

items()

keys()

values()

visit(func)

Recursively visit all names in the group and subgroups.

visititems(func[, noindex])

Recursively visit all objects in this group and subgroups.

property attrs

attrs attribute.

get(k[, d]) D[k] if k in D, else d.  d defaults to None.
get_lazy_view(y: object) object[source]

This instantiates the object y, and if it is a chunked dataset, does so without reading the b-tree index. This is useful for inspecting a variable that you are not expecting to access. If you know you want to access the data, and in particular, if you are going to hand the data to Dask or something else, you almost certainly want to read the index now, so just do x[y] rather than x.get_lazy_view(y).

This is a pyfive extension to the standard h5py API.

items() a set-like object providing a view on D's items
keys() a set-like object providing a view on D's keys
values() an object providing a view on D's values
visit(func: Callable) object[source]

Recursively visit all names in the group and subgroups.

func should be a callable with the signature:

func(name) -> None or return value

Returning None continues iteration, return anything else stops and return that value from the visit method.

visititems(func: Callable, noindex: bool = False) object[source]

Recursively visit all objects in this group and subgroups.

func should be a callable with the signature:

func(name, object) -> None or return value

Returning None continues iteration, return anything else stops and return that value from the visit method.

Use of the optional noindex=True will ensure that all operations are not only lazy wrt data, but lazy wrt to any chunked data indices. This keyword argument is a pyfive extension to the standard h5py API.

Dataset

class pyfive.Dataset(name: str, datasetid: DatasetID, parent: Group)[source]

Bases: ABC

A HDF5 Dataset containing an n-dimensional array and meta-data attributes.

shape

Dataset dimensions.

Type:

tuple

dtype

Dataset’s type.

Type:

dtype

size

Total number of elements in the dataset.

Type:

int

chunks

Chunk shape, or NOne is chunked storage not used.

Type:

tuple or None

compression

Compression filter used on dataset. None if compression is not enabled for this dataset.

Type:

str or None

compression_opts

Options for the compression filter.

Type:

dict or None

scaleoffset

Setting for the HDF5 scale-offset filter, or None if scale-offset compression is not used for this dataset.

Type:

dict or None

shuffle

Whether the shuffle filter is applied for this dataset.

Type:

bool

fletcher32

Whether the Fletcher32 checksumming is enabled for this dataset.

Type:

bool

fillvalue

Value indicating uninitialized portions of the dataset. None is no fill values has been defined.

Type:

float or None

dim

Number of dimensions.

Type:

int

dims

Dimension scales.

Type:

None

attrs

Attributes for this dataset.

Type:

dict

name

Full path to this dataset.

Type:

str

file

File instance where this dataset resides.

Type:

File

parent

Group instance containing this dataset.

Type:

Group

Methods:

astype(dtype)

Return a context manager which returns data as a particular type.

iter_chunks([sel])

len()

Return the size of the first axis.

read_direct(array[, source_sel, dest_sel])

Read from a HDF5 dataset directly into a NumPy array.

Attributes:

attrs

attrs attribute.

chunks

chunks attribute.

compression

compression attribute.

compression_opts

compression_opts attribute.

dims

dims attribute.

dtype

dtype attribute.

fillvalue

fillvalue attribute.

fletcher32

fletcher32 attribute.

id

This is the DatasetID instance which provides the actual data access methods.

maxshape

maxshape attribute.

ndim

number of dimensions.

scaleoffset

scaleoffset attribute.

shape

shape attribute.

shuffle

shuffle attribute.

size

size attribute.

value

alias for dataset[()].

astype(dtype: str) AstypeContext[source]

Return a context manager which returns data as a particular type.

Conversion is handled by NumPy after reading extracting the data.

property attrs

attrs attribute.

property chunks

chunks attribute.

property compression

compression attribute.

property compression_opts

compression_opts attribute.

property dims

dims attribute.

property dtype

dtype attribute.

property fillvalue

fillvalue attribute.

property fletcher32

fletcher32 attribute.

id

This is the DatasetID instance which provides the actual data access methods.

iter_chunks(sel=())[source]
len()[source]

Return the size of the first axis.

property maxshape

maxshape attribute. (None for unlimited dimensions)

property ndim

number of dimensions.

read_direct(array: ndarray, source_sel: None | tuple = None, dest_sel: None | tuple = None) None[source]

Read from a HDF5 dataset directly into a NumPy array.

This is equivalent to dset[source_sel] = arr[dset_sel].

Creation of intermediates is not avoided. This method if provided from compatibility with h5py, it is not efficient.

property scaleoffset

scaleoffset attribute.

property shape

shape attribute.

property shuffle

shuffle attribute.

property size

size attribute.

property value

alias for dataset[()].

DatasetID

class pyfive.h5d.DatasetID(dataobject: DataObjects, noindex: bool = False, pseudo_chunking_size_MB: int = 4)[source]

Bases: ChunkRead

Implements an “HDF5 dataset identifier”, which despite the name, actually represents the data of a dataset in a file, and not an identifier. It includes all the low level methods for working with chunked data, lazily or not.

This class has been deliberately implemented in such as way so as to cache all the relevant metadata, so that once you have an instance, it is completely independent of the parent file, and it can be used efficiently in distributed threads without thread contention to the b-tree etc. This behaviour may differ from h5py, which cannot isolate the dataset access from the parent file access as both share underlying C-structures.

Attributes:

btree_range

A tuple with the addresses of the first b-tree node for this variable, and the address of the furthest away node (Which may not be the last one in the chunk index).

dtype

Return numpy dtype of the dataset.

first_chunk

The integer address of the first data chunk for this variable.

index

Direct access to the chunk index, if there is one.

Methods:

get_chunk_info(index)

Retrieve storage information about a chunk specified by its index.

get_chunk_info_by_coord(coordinate_index)

Retrieve information about a chunk specified by the array address of the chunk’s first element in each dimension.

get_chunk_info_from_chunk_coord(chunk_coords)

Retrieve storage information about a chunk specified by its index.

get_data(args, fillvalue)

Called by the dataset getitem method

get_num_chunks()

Return total number of chunks in dataset

get_type()

Return pyfive type of the dataset.

iter_chunks(args)

Iterate over chunks in a chunked dataset.

read_direct_chunk(chunk_position, **kwargs)

Returns a tuple containing the filter_mask and the raw data storing this chunk as bytes.

set_parallelism([thread_count, ...])

Configure chunk-read parallelism.

set_pseudo_chunk_size(newsize_MB)

Set pseudo chunking size for contiguous variables.

property btree_range

A tuple with the addresses of the first b-tree node for this variable, and the address of the furthest away node (Which may not be the last one in the chunk index). This property may be of use in understanding the read performance of chunked data in object stores. btree_range is a pyfive API extension.

property dtype

Return numpy dtype of the dataset.

property first_chunk

The integer address of the first data chunk for this variable.

This property may be of use in understanding the read performance of chunked data in object stores. first_chunk is a pyfive API extension.

get_chunk_info(index)[source]

Retrieve storage information about a chunk specified by its index.

get_chunk_info_by_coord(coordinate_index)[source]

Retrieve information about a chunk specified by the array address of the chunk’s first element in each dimension.

get_chunk_info_from_chunk_coord(chunk_coords)[source]

Retrieve storage information about a chunk specified by its index. This is a pyfive API extension. This index is in chunk space (as used by zarr) and needs to be converted to HDF5 coordinate space. Additionally, if this file is not chunked, the storeinfo is returned for the contiguous data as if it were one chunk.

get_data(args, fillvalue)[source]

Called by the dataset getitem method

get_num_chunks()[source]

Return total number of chunks in dataset

get_type()[source]

Return pyfive type of the dataset.

property index

Direct access to the chunk index, if there is one. This is a pyfive API extension.

iter_chunks(args)[source]

Iterate over chunks in a chunked dataset.

The args argument is a (possibly empty) sequence of indices that defines the region to be used. If an empty sequence then the entire dataspace will be used for the iterator.

For each chunk within the given region, the iterator yields a tuple of indices that gives the intersection of the given chunk with the selection area. This can be used to read data in that chunk.

read_direct_chunk(chunk_position, **kwargs)[source]

Returns a tuple containing the filter_mask and the raw data storing this chunk as bytes. Additional arguments supported by h5py are not supported here.

set_parallelism(thread_count=0, cat_range_allowed=True, btree_parallel=False)

Configure chunk-read parallelism.

thread_count controls POSIX threaded reads via os.pread: - 0 disables threaded reads - >0 enables threaded reads with that many workers - Default 4

cat_range_allowed enables fsspec bulk reads via cat_ranges for compatible non-posix file handles. Default True

btree_parallel enables parallel reads for b-tree nodes when building the chunk index. Default False.

This is a pyfive API extension. It is recommended to enable it when working with remote files, but it may not be suitable for local files. Hence defaults are that cat_ranges is on (for remote files) and threads are off (for local files).

set_pseudo_chunk_size(newsize_MB)[source]

Set pseudo chunking size for contiguous variables. This is a pyfive API extension. The default value is 4 MB which should be suitable for most applications. For arrays smaller than this value, no pseudo chunking is used. Larger arrays will be accessed in in roughly newsize_MB reads.

Datatype

class pyfive.Datatype(name, hfile, raw_dtype)[source]

Bases: object

Provides a minimal instantiation of an h5py DataType suitable for use with enumerations, compounds and other types.

Attributes:

dtype

property dtype

The h5t module

Partial implementation of some of the lower level h5py API, needed to support enumerations, variable length strings, and opaque datatypes.

pyfive.h5t.check_enum_dtype(dt)[source]

If the dtype represents an HDF5 enumerated type, returns the dictionary mapping string names to integer values. Returns None if the dtype does not represent an HDF5 enumerated type.

pyfive.h5t.check_string_dtype(dt)[source]

The returned string_info object holds the encoding and the length. The encoding can only be ‘utf-8’. The length will be None for a variable-length string. Returns None if the dtype does not represent a pyfive string.

pyfive.h5t.check_dtype(**kwds)[source]

Check a dtype for h5py special type “hint” information. Only one keyword may be given.

vlen = dtype

If the dtype represents an HDF5 vlen, returns the Python base class. Currently only built-in string vlens (str) are supported. Returns None if the dtype does not represent an HDF5 vlen.

enum = dtype

If the dtype represents an HDF5 enumerated type, returns the dictionary mapping string names to integer values. Returns None if the dtype does not represent an HDF5 enumerated type.

opaque = dtype

If the dtype represents an HDF5 opaque type, returns True. Returns False if the dtype does not represent an HDF5 opaque type.

pyfive.h5t.check_opaque_dtype(dt)[source]

If the dtype represents an HDF5 opaque type, returns True. Returns False if the dtype does not represent an HDF5 opaque type.

class pyfive.h5t.TypeEnumID(raw_dtype)[source]

Bases: TypeID

Used by DataType to expose internal structure of an enum datatype. This is instantiated by pyfive using arcane hdf5 structure information, and should not normally be needed by any user code.

Attributes:

dtype

The numpy dtype.

kind

metadata

Methods:

enum_nameof(index)

Determine the name associated with the given value.

enum_valueof(name)

Get the value associated with an enum name.

get_class()

property dtype

The numpy dtype.

enum_nameof(index)[source]

Determine the name associated with the given value.

enum_valueof(name)[source]

Get the value associated with an enum name.

get_class()
property kind
property metadata