The Reader class
Here we describe the main component of the AQUA core, the Reader class.
The other main core components are described in the Regrid and interpolation capabilities, Fixer functionalities and Other core functionalities sections.
Some extra functionalities can be found in the Advanced Topics section.
The Reader
The Reader class provides AQUA access to data, developed to offer a centralized common data access point.
AQUA Reader can, in fact, access different file formats and data from the FDB or intake catalogs,
and delivers xarray objects.
On top of data access, the Reader is also able to perform multiple operations on the data:
interpolation and regridding (see Regrid and interpolation capabilities), spatial and temporal averaging and metadata correction (see Fixer functionalities).
These operations can be both embedded in the Reader class or used by initializating separate components.
The Reader class is also able to perform parallel processing and to stream data,
since high-resolution data can be too large to be loaded in memory all at once
and it may be necessary to process data in chunks or even step by step.
Input and Output formats
AQUA supports a variety of climate data file input formats:
NetCDF
GRIB files
Zarr
FDB GRIB
ARCO files
Parquet
After the data are retrieved, the Reader class returns an xarray object,
specifically an xarray.Dataset, where only the metadata are loaded in memory.
Note
Since metadata are the minimum information needed to load the data and prepare the processing, large sets of numerous NetCDF files are easy to read, but they may require to open a large amount of data to be able to check all the metadata. We then suggest, if low performance is experienced, to use the Zarr format on top of the NetCDF format, to significantly improve the performance of the data access.
Catalog exploration
For an extensive catalog exploration, you can use the show_catalog_content() function.
This function scans catalog(s) by reading YAML files directly and displays the model/exp/source structure.
The simplest way to use it is:
from aqua import show_catalog_content
results = show_catalog_content()
This will scan all available catalogs and output at the info level a dictionary with catalog names and nested values. You can also filter by specific catalog(s), model, experiment, or source:
# Scan specific catalog(s)
results = show_catalog_content(catalog=['ci','obs'])
# Filter by model
results = show_catalog_content(model='IFS-NEMO')
By default, show_catalog_content() displays sources in a compact 3-column format for quick overview.
You can enable the show_descriptions=True parameter to display each source on its own line with its description
from the catalog entry, providing additional information:
# Show catalog with detailed descriptions
catalog_content = show_catalog_content(model='IFS-NEMO', show_descriptions=True)
Note
The show_catalog_content() function is a convenience wrapper that handles ConfigPath initialization internally.
If you need more control over the configuration, you can still use the method directly from the ConfigPath class:
from aqua.core.configurer import ConfigPath
config = ConfigPath(loglevel='info')
results = config.show_catalog_content()
Reader basic usage
Once you know which dataset you want to access, a call to the Reader can be done.
The basic call to the Reader is:
from aqua import Reader
reader = Reader(model='IFS-NEMO', exp='historical-1990', source='lra-r100-monthly')
data = reader.retrieve()
This will return a Reader object that can be used to access the data.
The retrieve() method will return an xarray.Dataset to be used for further processing.
The catalog, differently from the model, exp and source arguments, is optional.
However, if the triplet is not unique across catalogs, the Reader will guess the correct catalog,
so it is suggested to always specify it when possible.
The basic call enables fixer, area and time average functionalities, but no regridding or streaming. To have a complete overview of the available options, please check the API Reference.
If some information about the data is needed, it is possible to use the info() method of the Reader class.
The retrieve() method accepts further arguments to select only specific variables or vertical levels as in this example:
from aqua import Reader
reader = Reader(model="ERA5", exp="era5-hpz3", source="monthly")
data = reader.retrieve(var=["t", "2t"], level=[50000, 70000])
Warning
Every Reader instance carries information about the grids and fixes of the retrieved data.
If you’re retrieving data from many sources, please instantiate a new Reader for each source.
Note
The Reader class will try to convert the time coordinate to a Gregorian calendar if time is present,
keeping the time units to microsecond precision in order to keep the axis as datetime64[us].
Reader and Intake parameters
Sources can have additional parameters defined in the intake catalog.
For example the same experiment could have multiple ensemble members.
A parameter realization could be then defined in the catalog to distinguish between different ensemble members.
To pass these additional parameters to the Reader, you can use the **kwargs functionality.
For example, to select a specific ensemble member, you can do:
from aqua import Reader
reader = Reader(catalog='ci', model='CMCC', exp='historical', source='monthly-atm', realization=1)
data = reader.retrieve()
See Intake documentation for more details about Intake source parameters.
Dask and streaming capabilities
The standard usage of the Reader class will load metadata in memory and
make the data available for processing.
This is the standard behaviour of the Reader class, where xarray and dask
capabilities are used to retrieve the data.
This allows to fully process also large datasets using dask lazy and parallel processing capabilities.
However, for specific testing or development needs,
the Reader class is also able to allow a streaming of data,
where the data are loaded in chunks and processed step by step.
Please check the Streaming of data section for more details.
Note
Dask access to data is available also for FDB data. Since a specific intake driver has been developed, if you’re adding new FDB sources to the catalog, we suggest to read the Dask access to FDB or GSV section.