API Reference
File
- class pyfive.File(filename: str | BinaryIO | MetadataBufferingWrapper, mode: str = 'r', metadata_buffer_size: int = 1)[source]
Bases:
GroupOpen a HDF5 file.
Note in addition to having file specific methods the File object also inherit the full interface of Group.
File is also a context manager and therefore supports the with statement. Files opened by the class will be closed after the with block, file-like object are not closed.
- Parameters:
filename (str or file-like) – Name of file (string or unicode) or file like object which has read and seek methods which behaved like a Python file object.
mode (str) – File open mode (default: “r”, read-only).
metadata_buffer_size (int) – Size of metadata buffer for S3/remote files in MiB (default: 1MiB). Larger values reduce network calls but use more memory. (This is a pyfive extension for optimizing remote file access, ignored for local files.)
- filename
Name of the file on disk, None if not available.
- Type:
- mode
String indicating that the file is open readonly (“r”).
- Type:
- userblock_size
Size of the user block in bytes (currently always 0).
- Type:
Attributes:
attrsattrs attribute.
consolidated_metadataReturns True if all B-tree nodes for chunked datasets are located before the first chunk in the file.
Methods:
close()Close the file.
get(k[,d])get_lazy_view(y)This instantiates the object y, and if it is a chunked dataset, does so without reading the b-tree index.
items()keys()values()visit(func)Recursively visit all names in the group and subgroups.
visititems(func[, noindex])Recursively visit all objects in this group and subgroups.
- property attrs
attrs attribute.
- close()[source]
Close the file.
- property consolidated_metadata: bool
Returns True if all B-tree nodes for chunked datasets are located before the first chunk in the file.
- get(k[, d]) D[k] if k in D, else d. d defaults to None.
- get_lazy_view(y: object) object
This instantiates the object y, and if it is a chunked dataset, does so without reading the b-tree index. This is useful for inspecting a variable that you are not expecting to access. If you know you want to access the data, and in particular, if you are going to hand the data to Dask or something else, you almost certainly want to read the index now, so just do x[y] rather than x.get_lazy_view(y).
This is a
pyfiveextension to the standard h5py API.
- items() a set-like object providing a view on D's items
- keys() a set-like object providing a view on D's keys
- values() an object providing a view on D's values
- visit(func: Callable) object
Recursively visit all names in the group and subgroups.
func should be a callable with the signature:
func(name) -> None or return value
Returning None continues iteration, return anything else stops and return that value from the visit method.
- visititems(func: Callable, noindex: bool = False) object
Recursively visit all objects in this group and subgroups.
func should be a callable with the signature:
func(name, object) -> None or return value
Returning None continues iteration, return anything else stops and return that value from the visit method.
Use of the optional noindex=True will ensure that all operations are not only lazy wrt data, but lazy wrt to any chunked data indices. This keyword argument is a
pyfiveextension to the standard h5py API.
Group
- class pyfive.Group(name: str, dataobjects: DataObjects, parent: Group)[source]
Bases:
MappingAn HDF5 Group which may hold attributes, datasets, or other groups.
- attrs
Attributes for this group.
- Type:
- name
Full path to this group.
- Type:
- file
File instance where this group resides.
- Type:
File
- parent
Group instance containing this group.
- Type:
Group
Attributes:
attrsattrs attribute.
Methods:
get(k[,d])get_lazy_view(y)This instantiates the object y, and if it is a chunked dataset, does so without reading the b-tree index.
items()keys()values()visit(func)Recursively visit all names in the group and subgroups.
visititems(func[, noindex])Recursively visit all objects in this group and subgroups.
- property attrs
attrs attribute.
- get(k[, d]) D[k] if k in D, else d. d defaults to None.
- get_lazy_view(y: object) object[source]
This instantiates the object y, and if it is a chunked dataset, does so without reading the b-tree index. This is useful for inspecting a variable that you are not expecting to access. If you know you want to access the data, and in particular, if you are going to hand the data to Dask or something else, you almost certainly want to read the index now, so just do x[y] rather than x.get_lazy_view(y).
This is a
pyfiveextension to the standard h5py API.
- items() a set-like object providing a view on D's items
- keys() a set-like object providing a view on D's keys
- values() an object providing a view on D's values
- visit(func: Callable) object[source]
Recursively visit all names in the group and subgroups.
func should be a callable with the signature:
func(name) -> None or return value
Returning None continues iteration, return anything else stops and return that value from the visit method.
- visititems(func: Callable, noindex: bool = False) object[source]
Recursively visit all objects in this group and subgroups.
func should be a callable with the signature:
func(name, object) -> None or return value
Returning None continues iteration, return anything else stops and return that value from the visit method.
Use of the optional noindex=True will ensure that all operations are not only lazy wrt data, but lazy wrt to any chunked data indices. This keyword argument is a
pyfiveextension to the standard h5py API.
Dataset
- class pyfive.Dataset(name: str, datasetid: DatasetID, parent: Group)[source]
Bases:
ABCA HDF5 Dataset containing an n-dimensional array and meta-data attributes.
- shape
Dataset dimensions.
- Type:
- dtype
Dataset’s type.
- Type:
dtype
- size
Total number of elements in the dataset.
- Type:
- chunks
Chunk shape, or NOne is chunked storage not used.
- Type:
tuple or None
- compression
Compression filter used on dataset. None if compression is not enabled for this dataset.
- Type:
str or None
- compression_opts
Options for the compression filter.
- Type:
dict or None
- scaleoffset
Setting for the HDF5 scale-offset filter, or None if scale-offset compression is not used for this dataset.
- Type:
dict or None
- shuffle
Whether the shuffle filter is applied for this dataset.
- Type:
- fletcher32
Whether the Fletcher32 checksumming is enabled for this dataset.
- Type:
- fillvalue
Value indicating uninitialized portions of the dataset. None is no fill values has been defined.
- Type:
float or None
- dim
Number of dimensions.
- Type:
- dims
Dimension scales.
- Type:
None
- attrs
Attributes for this dataset.
- Type:
- name
Full path to this dataset.
- Type:
- file
File instance where this dataset resides.
- Type:
File
- parent
Group instance containing this dataset.
- Type:
Group
Methods:
astype(dtype)Return a context manager which returns data as a particular type.
iter_chunks([sel])len()Return the size of the first axis.
read_direct(array[, source_sel, dest_sel])Read from a HDF5 dataset directly into a NumPy array.
Attributes:
attrsattrs attribute.
chunkschunks attribute.
compressioncompression attribute.
compression_optscompression_opts attribute.
dimsdims attribute.
dtypedtype attribute.
fillvaluefillvalue attribute.
fletcher32fletcher32 attribute.
idThis is the DatasetID instance which provides the actual data access methods.
maxshapemaxshape attribute.
ndimnumber of dimensions.
scaleoffsetscaleoffset attribute.
shapeshape attribute.
shuffleshuffle attribute.
sizesize attribute.
valuealias for dataset[()].
- astype(dtype: str) AstypeContext[source]
Return a context manager which returns data as a particular type.
Conversion is handled by NumPy after reading extracting the data.
- property attrs
attrs attribute.
- property chunks
chunks attribute.
- property compression
compression attribute.
- property compression_opts
compression_opts attribute.
- property dims
dims attribute.
- property dtype
dtype attribute.
- property fillvalue
fillvalue attribute.
- property fletcher32
fletcher32 attribute.
- id
This is the DatasetID instance which provides the actual data access methods.
- iter_chunks(sel=())[source]
- len()[source]
Return the size of the first axis.
- property maxshape
maxshape attribute. (None for unlimited dimensions)
- property ndim
number of dimensions.
- read_direct(array: ndarray, source_sel: None | tuple = None, dest_sel: None | tuple = None) None[source]
Read from a HDF5 dataset directly into a NumPy array.
This is equivalent to dset[source_sel] = arr[dset_sel].
Creation of intermediates is not avoided. This method if provided from compatibility with h5py, it is not efficient.
- property scaleoffset
scaleoffset attribute.
- property shape
shape attribute.
- property shuffle
shuffle attribute.
- property size
size attribute.
- property value
alias for dataset[()].
DatasetID
- class pyfive.h5d.DatasetID(dataobject: DataObjects, noindex: bool = False, pseudo_chunking_size_MB: int = 4)[source]
Bases:
ChunkReadImplements an “HDF5 dataset identifier”, which despite the name, actually represents the data of a dataset in a file, and not an identifier. It includes all the low level methods for working with chunked data, lazily or not.
This class has been deliberately implemented in such as way so as to cache all the relevant metadata, so that once you have an instance, it is completely independent of the parent file, and it can be used efficiently in distributed threads without thread contention to the b-tree etc. This behaviour may differ from
h5py, which cannot isolate the dataset access from the parent file access as both share underlying C-structures.Attributes:
A tuple with the addresses of the first b-tree node for this variable, and the address of the furthest away node (Which may not be the last one in the chunk index).
dtypeReturn numpy dtype of the dataset.
The integer address of the first data chunk for this variable.
indexDirect access to the chunk index, if there is one.
Methods:
get_chunk_info(index)Retrieve storage information about a chunk specified by its index.
get_chunk_info_by_coord(coordinate_index)Retrieve information about a chunk specified by the array address of the chunk’s first element in each dimension.
get_chunk_info_from_chunk_coord(chunk_coords)Retrieve storage information about a chunk specified by its index.
get_data(args, fillvalue)Called by the dataset getitem method
get_num_chunks()Return total number of chunks in dataset
get_type()Return pyfive type of the dataset.
iter_chunks(args)Iterate over chunks in a chunked dataset.
read_direct_chunk(chunk_position, **kwargs)Returns a tuple containing the filter_mask and the raw data storing this chunk as bytes.
set_parallelism([thread_count, ...])Configure chunk-read parallelism.
set_pseudo_chunk_size(newsize_MB)Set pseudo chunking size for contiguous variables.
- property btree_range
A tuple with the addresses of the first b-tree node for this variable, and the address of the furthest away node (Which may not be the last one in the chunk index). This property may be of use in understanding the read performance of chunked data in object stores.
btree_rangeis apyfiveAPI extension.
- property dtype
Return numpy dtype of the dataset.
- property first_chunk
The integer address of the first data chunk for this variable.
This property may be of use in understanding the read performance of chunked data in object stores.
first_chunkis apyfiveAPI extension.
- get_chunk_info(index)[source]
Retrieve storage information about a chunk specified by its index.
- get_chunk_info_by_coord(coordinate_index)[source]
Retrieve information about a chunk specified by the array address of the chunk’s first element in each dimension.
- get_chunk_info_from_chunk_coord(chunk_coords)[source]
Retrieve storage information about a chunk specified by its index. This is a
pyfiveAPI extension. This index is in chunk space (as used byzarr) and needs to be converted to HDF5 coordinate space. Additionally, if this file is not chunked, the storeinfo is returned for the contiguous data as if it were one chunk.
- get_data(args, fillvalue)[source]
Called by the dataset getitem method
- get_num_chunks()[source]
Return total number of chunks in dataset
- get_type()[source]
Return pyfive type of the dataset.
- property index
Direct access to the chunk index, if there is one. This is a
pyfiveAPI extension.
- iter_chunks(args)[source]
Iterate over chunks in a chunked dataset.
The args argument is a (possibly empty) sequence of indices that defines the region to be used. If an empty sequence then the entire dataspace will be used for the iterator.
For each chunk within the given region, the iterator yields a tuple of indices that gives the intersection of the given chunk with the selection area. This can be used to read data in that chunk.
- read_direct_chunk(chunk_position, **kwargs)[source]
Returns a tuple containing the filter_mask and the raw data storing this chunk as bytes. Additional arguments supported by
h5pyare not supported here.
- set_parallelism(thread_count=0, cat_range_allowed=True, btree_parallel=False)
Configure chunk-read parallelism.
thread_countcontrols POSIX threaded reads viaos.pread: -0disables threaded reads ->0enables threaded reads with that many workers - Default 4cat_range_allowedenables fsspec bulk reads viacat_rangesfor compatible non-posix file handles. Default Truebtree_parallelenables parallel reads for b-tree nodes when building the chunk index. Default False.This is a
pyfiveAPI extension. It is recommended to enable it when working with remote files, but it may not be suitable for local files. Hence defaults are that cat_ranges is on (for remote files) and threads are off (for local files).
- set_pseudo_chunk_size(newsize_MB)[source]
Set pseudo chunking size for contiguous variables. This is a
pyfiveAPI extension. The default value is 4 MB which should be suitable for most applications. For arrays smaller than this value, no pseudo chunking is used. Larger arrays will be accessed in in roughlynewsize_MBreads.
Datatype
The h5t module
Partial implementation of some of the lower level h5py API, needed
to support enumerations, variable length strings, and opaque datatypes.
- pyfive.h5t.check_enum_dtype(dt)[source]
If the dtype represents an HDF5 enumerated type, returns the dictionary mapping string names to integer values. Returns None if the dtype does not represent an HDF5 enumerated type.
- pyfive.h5t.check_string_dtype(dt)[source]
The returned string_info object holds the encoding and the length. The encoding can only be ‘utf-8’. The length will be None for a variable-length string. Returns None if the dtype does not represent a pyfive string.
- pyfive.h5t.check_dtype(**kwds)[source]
Check a dtype for h5py special type “hint” information. Only one keyword may be given.
- vlen = dtype
If the dtype represents an HDF5 vlen, returns the Python base class. Currently only built-in string vlens (str) are supported. Returns None if the dtype does not represent an HDF5 vlen.
- enum = dtype
If the dtype represents an HDF5 enumerated type, returns the dictionary mapping string names to integer values. Returns None if the dtype does not represent an HDF5 enumerated type.
- opaque = dtype
If the dtype represents an HDF5 opaque type, returns True. Returns False if the dtype does not represent an HDF5 opaque type.
- pyfive.h5t.check_opaque_dtype(dt)[source]
If the dtype represents an HDF5 opaque type, returns True. Returns False if the dtype does not represent an HDF5 opaque type.
- class pyfive.h5t.TypeEnumID(raw_dtype)[source]
Bases:
TypeIDUsed by DataType to expose internal structure of an enum datatype. This is instantiated by pyfive using arcane hdf5 structure information, and should not normally be needed by any user code.
Attributes:
The numpy dtype.
Methods:
enum_nameof(index)Determine the name associated with the given value.
enum_valueof(name)Get the value associated with an enum name.
- property dtype
The numpy dtype.
- get_class()
- property kind
- property metadata