.. _usage: ***** Usage ***** In the main, one uses ``pyfive`` exactly as one would use ``h5py`` so the documentation for ``h5py`` is also relevant. However, ``pyfive`` has some additional API features and optimisations which are noted in the section on "Additional API Features". .. note:: The ``pyfive`` API does not FULLY implement the entire ``h5py`` API, in particular, there is no support for writing files, and most of the low-level ``h5py`` API is not implemented. That said, if you find a case where the high level ``h5py`` API read functionality is not supported by ``pyfive``, please report it as an issue on our `GitHub Issues `_ page. Working with Files ================== ``pyfive`` provides a high-level interface to HDF5 files, similar to ``h5py``. You can open an HDF5 file and access its contents using dictionary-like syntax, and there is support for lazy loading of datasets on both Posix and S3 filesystems. Here is a simple example of how to open an HDF5 file and read its contents using ``pyfive``: .. code-block:: python import pyfive # Open the file in read-only mode with pyfive.File("data.h5", "r") as f: # List the top-level groups and datasets print("Keys:", list(f.keys())) # Access a group grp = f["/my_group"] # List items inside that group print("Items in /my_group:", list(grp.keys())) # Access a dataset and inspect its shape and dtype dset = grp["my_dataset"] print("Shape:", dset.shape) print("Data type:", dset.dtype) # Read an entire dataset into a NumPy array data = dset[...] print("First element:", data[0]) In this example: * ``pyfive.File`` opens the file, returning a file object that behaves like a Python dictionary. * Groups (``Group`` objects) can be accessed using dictionary-like keys. * Datasets (``Dataset`` objects) expose attributes like ``shape`` and ``dtype`` which are loaded when you list them, but the data itself is not loaded from stroage into numpy arrays until you access it. (Lazy loading is discussed in more detail in the section on "Optimising Access Speed".) .. note:: If you are used to working with NetCDF4 files (and maybe `netcdf4-python `_), the concept of a ``File`` in ``pyfive`` corresponds to a NetCDF4 ``Dataset`` (both are read from an actual file), and the ``HDF5``/``pyfive``/``h5py`` concept of a ``Dataset`` corresponds to a NetCDF ``Variable`` (the notion of a group is semantically similar in both cases!). Working with datasets ===================== Most of the time, you will access datasets in a similar way to how you would with ``h5py``. You can read data from a dataset using slicing, and you can also access attributes associated with the dataset. Here is an example: .. code-block:: python import pyfive with pyfive.File("data.h5", "r") as f: # Access a dataset dset = f["/my_group/my_dataset"] # Read a slice of the dataset data_slice = dset[10:20] print("Data slice:", data_slice) # Access attributes of the dataset print("Attributes:", dset.attrs) One notable feature of ``pyfive`` is that the variable ``dset`` which we have just created is available outside of the context manager (i.e. after the ``with`` block). This means you can close the file and still work with the dataset, as long as you have instantiated it before closing the file. This is particularly useful for working with large datasets in a parallel environment where you might want to close the file to free up resources while still needing to access *some* of the data . .. note:: This functionality depends on the fact that the attributes and chunk index of the dataset are read when you first access it, so you can continue to use the dataset after closing the file. This is fully lazy (in that no data is read until needed) and thread-safe, and we have tests to ensure that this behavior works correctly even in multi-threaded scenarios. Using S3/Object Storage ======================= ``pyfive`` is designed to work seamlessly with both local filesystems and `S3`-compatible object storage (and probably any remote storage that supports the `fsspec `_ API). However, there are some additional considerations when working with S3, the most important of which is the need to use the ``s3fs`` library to provide a filesystem interface to S3. Here is a simple example of how to open an HDF5 file stored in S3 and read its contents using ``pyfive``: .. code-block:: python import pyfive import s3fs S3_URL = "https://my-s3-place.ac.uk" filename = "filename.nc" blocks_MB = 1 # Set the block size for S3 access s3params = { 'endpoint_url': S3_URL, 'default_fill_cache':False, 'default_cache_type':"readahead", 'default_block_size': blocks_MB * 2**20 } fs = s3fs.S3FileSystem(anon=True, **s3params) # now we can open the file using the S3 filesystem uri = 's3-bucket/' + filename with fs.open(uri, 'rb') as s3file: with pyfive.File(s3file, "r") as f: dset = f['variable'] data = dset[10:20] print("Data slice from S3:", data) .. note:: The ideal ``s3fs`` parameters to use (``s3params``) will depend on what you are actually doing with the file, as discussed in the section on "Optimising Access Speed". The parameters shown above work well for accessing small amounts of data from a large file, but you may need to adjust them for your specific use case. # FIXME: Check if the following is still accurate This example also shows that while it is possible to close the file access context manager and still access the datasets, you will need to ensure that the S3 filesystem is still available.