Adding new data
To add new data the 3-level hierarchy on which AQUA is based must be respected (i.e. model - exp - source), so that specific files must be created within the catalog. How to create a new source and add new data is documented in the next sections. If you need to add data in a new catalog, please refer to Adding a new catalog before proceeding.
To add your data to AQUA, you have to provide an
intakecatalog that describes your data, and in particular, the location of the data. This can be done in two different way, by adding a standard entry in the form of files (File-based sources) or by adding a source from the FDB (FDB-based sources) with the specific AQUA FDB interface.A set of pre-existing fixes can be applied to the data, or you can modify or create your own fixes (see Fixer functionalities).
To exploit the regridding functionalities, you will also need to verify the grid is available in the
aqua/core/config/gridsfolder or to add it (see Grid definitions).Finally, if the installation was not editable or you created a new catalog, you will need to copy or link the new catalog files in the AQUA configuration folder (see aqua add <catalog>).
File-based sources
Adding file-based sources in AQUA is done with default interface by intake.
Files supported can include NetCDF files (as the one described in the example below) or other formats as GRIB or Zarr.
The best way to explain the process is to follow the example of adding some fake dataset to an existing catalog (obs in our example).
Let’s imagine we have a dataset called SST, with yearly data, part of a data collection called NEWSATDATA which we would like to add.
Suppose that the dataset consists of the following:
2 netCDF files, each file contains one year of data (
/data/path/1990.ncand/data/path/1991.nc)data are stored in 2D arrays, with dimensions
latandloncoordinate variables are
latandlon, and the time variable istime, all one dimensionaldata located on the LUMI machine
We will create a catalog entry that will describe this dataset.
The “model” name will be NEWSATDATA (we use the “model/exp/source” hierarchy also for observations)
The first step is to add a new entry to the config/catalogs/obs/catalog.yaml file.
The additional entry, to be added in the sources: section, will look like this:
NEWSATDATA:
description: amazing NEWSAT collection of SST data
driver: yaml_file_cat
args:
path: "{{CATALOG_DIR}}/NEWSATDATA/main.yaml"
This will create the model entry within the catalog that can be used later by the Reader().
Note
Catalog source files are processed by intake, which means that they can exploit of the built-in Jinja2 template replacement as done in the example above with {{CATALOG_DIR}}.
Then we will need to create an appropriate entry at the exp level, which will be included in the file config/catalogs/lumi//catalog/NEWSATDATA/main.yaml.
For our example we will call this “experiment” SST.
In our case, the main.yaml file will look like this (but many other experiments, corresponding to the same model, can be added in the sources section):
sources:
SST:
description: amazing SST dataset from the NEWSATDATA collection
driver: yaml_file_cat
args:
path: "{{CATALOG_DIR}}/SST.yaml"
The final step is to create/edit the file SST.yaml, also to be stored config/catalogs/lumi/catalog/NEWSATDATA directory.
The most straightforward intake catalog describing our dataset will look like this:
plugins:
source:
- module: intake_xarray
sources:
annual:
description: my amazing yearly_SST dataset
driver: netcdf
args:
chunks:
time: 1
urlpath:
- /data/path/1990.nc
- /data/path/1991.nc
metadata:
source_grid_name: lon-lat
fixer_name: amazing_fixer
Where we have specified the source name of the catalog entry.
As for the exp case, we could have multiple sources for the same experiment (for example, monthly data, daily data, etc).
Once this is defined, we can access our dataset from AQUA with the following command:
from aqua import Reader
reader = Reader(model="NEWSATDATA", exp="SST", source="annual")
data = reader.retrieve()
Finally, the metadata entry contains optional additional information useful to define how to postprocess the data:
source_grid_name: the grid name defined inaqua-grids.yamlto be used for areas and regridding.if set to
nullor not specified, the regridder will try to guess the grid based on the data coordinates, but this can be costly and not always successful. if set toFalse, the regridder will be disabled and no attempt to guess the grid will be done.
fixer_name: the name of the fixer defined in the fixes folder
deltat(optional): the cumulation window of fluxes in the dataset. This is a fixer option. If not present, the default is 1 second.
time_coder(optional): xarray builds on pandas, and pandas support a limited time range becuase time precision is based on nanoseconds.Despite recent updates to underlying
np.datetime64class, support for time range before 1678 CE and after 2262 AD is very limited. A partial solution build on passing a coarser time_coder, e.g. “s”. If this is specified modifies the time resolution when decoding dates. Underneath it is used by theCFDatetimeCoderand it is working only for NetCDF sources.
filter_key(optional): Sometimes NetCDF sources are based on many small files: loading long time series can be extremely slow due to the large number of files.This key is meant to filter files based information in the filename. Currently only “year” key is supported, which will filter files based on years between
startdateandenddate(of course, only if “year” is found in the filename). Requires that bothstartdateandenddateare specified in theReader()call (will not work at .retrieve()` level). Working only with NetCDF sources.
You can add fixes to your dataset by following examples in the aqua/core/config/fixes/ directory (see Fixer functionalities).
Note
If you want to add a Zarr or GRIB source the syntax may be slightly different, but the general structure of the catalog will be the same. You can find examples in the existing catalog or more information on the intake and intake-xarray documentation.
FDB-based sources
FDB based sources are built using a specific interface developed by AQUA. While the procedure of adding the catalog tree entries is the same, the main difference is on how the specific source is descrived.
We report here an example and we later describe the different elements.
sources:
hourly-hpz7-atm2d:
args:
request:
class: d1
dataset: climate-dt
activity: ScenarioMIP
experiment: SSP3-7.0
generation: 1
model: IFS-NEMO
realization: 1
resolution: standard
expver: '0001'
type: fc
stream: clte
date: 20210101
time: '0000'
param: 167
levtype: sfc
data_start_date: 20200101T0000
data_end_date: 20391231T2300
chunks: D # Default time chunk size
savefreq: h # at what frequency are data saved
timestep: h # base timestep for step timestyle
timestyle: date # variable date or variable step
description: hourly 2D atmospheric data on healpix grid (zoom=7, h128).
driver: gsv
metadata: &metadata-default
fdb_home: '{{ FDB_PATH }}'
fdb_home_bridge: '{{ FDB_PATH }}/databridge'
variables: [78, 79, 134, 137, 141, 148, 151, 159, 164, 165, 166, 167, 168, 186,
187, 188, 235, 260048, 8, 9, 144, 146, 147, 169, 175, 176, 177, 178, 179,
180, 181, 182, 212, 228]
source_grid_name: hpz7-nested
fixer_name: ifs-destine-v1
fdb_info_file: '{{ FDB_PATH }}/0001.yaml'
This is a source entry from the FDB of one of the AQUA control simulation from the IFS model.
The source name is hourly-native, because is suggesting that the catalog is made hourly data at the native model resolution.
Some of the parameters are here described:
- request
The
requestentry in the intake catalog primarily serves as a template for making data requests, following the standard MARS-style syntax used by the GSV retriever.The
dateparameter will be automatically overwritten by the appropriatedata_start_date.
This documentation provides an overview of the key parameters used in the catalog, helping users better understand how to configure their data requests effectively.
- data_start_date
This defines the starting date of the experiment. It is mandatory to be set up because there is no easy way to get this information directly from the FDB. In the case of the schema used in the operational experiments, which use the ‘date’
timestyle(see below), it is possible to set this parameter toauto. In that case the date will be automatically determined from the FDB (available only for local FDB access, not for databridge). Please notice that, due to how the date information is retrieved in theautocase, the time of the last date wll always be0000. If there is more data available on the last day, please consider setting the date manually.
- data_end_date
As above, it tells AQUA when to stop reading from the FDB and it can be set to
autotoo (only iftimestyleis ‘date’).
- bridge_start_date
This optional date is used for cases where part of the data are on the HPC FDB and part on the databridge. This is the first date/time for which data are stored on the databridge. Previous data are assumed to be on the HPC. If set to
completethen all data are assumed to be on the bridge. If omitted, butbridge_end_dateis set, it is assumed to be the same asdata_start_date. It can be set to a filename from which to read the date/time (in any format understood by pandas). If set tostac, the DestinE STAC API will be used to get bothbridge_start_dateandbridge_end_date. Only LUMI Bridge is supported for now.
- bridge_end_date
This optional date is used for cases where part of the data are on the HPC FDB and part on the databridge. This is the last date/time (included) for which data are stored on the databridge. Following data are assumed to be on the HPC. If set to
completethen all data are assumed to be on the bridge (equivalent to settingdata_end_dateto “complete”). If omitted, butbridge_start_dateis set, it is assumed to be the same asdata_end_date. It can be set to a filename from which to read the date/time (in any format understood by pandas) If set tostac, the DestinE Bridge STAC API will be used to get both bridge_start_date and bridge_end_date. Only LUMI Bridge is supported for now.
- hpc_expver
This optional parameter is used to specify the expver of the data on the HPC FDB. If not set, the expver is assumed to be the same for all data.
- chunks
The chunks parameter is essential, whether you are using Dask or a generator. It determines the size of the chunk loaded in memory at each iteration.
When using a generator, it corresponds to the chunk size loaded into memory during each iteration. For Dask, it controls the size of each chunk used by Dask’s parallel processing.
The choice of the chunks value is crucial as it strikes a balance between memory consumption and distributing enough work to each worker when Dask is utilized with multiple cores. In most cases, the default values in the catalog have been thoughtfully chosen through experimentation.
For instance, an chunks value of
D(for daily) works well for hourly-native data because it occupies approximately 1.2GB in memory. Increasing it beyond this limit may lead to memory issues.It is possible to choose a smaller chunks value, but keep in mind that each worker has its own overhead, and it is usually more efficient to retrieve as much data as possible from the FDB for each worker.
By the
chunksargument is a string and refers to time-chunking. In more advanced cases it is possible to chunk both in time and in the vertical (along levels) by passing a dictionary to chunks with the keystimeandvertical. In this casetimeis as usual a time frequency (in pandas notations) andverticalis instead the maxmimum number of vertical levels in each chunk.An example would be:
chunks:
time: D # Default time chunk size
vertical: 3 # Three vertical levels in each chunk
- timestep
The timestep parameter, denoted as
H, represents the original frequency of the model’s output.When timestep is set to
H, requesting data atstep=6andstep=7from the FDB will result in a time difference of 1 hour (1H).This parameter exists because even when dealing with monthly data, it is still stored at steps like 744, 1416, 2160, etc., which correspond to the number of hours since 00:00 on January 1st.
- savefreq
Savefreq, indicated as
Mfor monthly orhfor hourly, signifies the actual frequency at which data are available in this stream.Combining this information with the timestep parameter allows us to anticipate data availability at specific steps, such as 744 and 1416 for monthly data.
- timestyle
The timestyle parameter can be set to either
dateoryearmonthaccording to the FDB schema. Indeed, it determines how the time axis data is written in the FDB.The above examples have used
date, directly specifying bothdateandtimein the request. When using instead theyearmonthtimestyle you do not have to set neither time or date in the request. On the contrary, theyearandmonthkeys need to be specified. The FDB module will then build the corresponding request.Please note that it is very important to know which timestyle has been used in the FDB before creating the request
- timeshift
Timeshift is a boolean parameter used exclusively for shifting the date of monthly data back by one month.
Implementing this correctly in a general case can be quite complex, so it was decided to implement only the monthly shift.
- metadata
This includes important supplementary information:
fdb_home: the path to where the FDB data are storedfdb_path: the path of the FDB configuration file (deprecated, use only if config.yaml is in a not standard place)fdb_home_bridge: FDB_HOME for bridge accessfdb_path_bridge: the path of the FDB configuration file for bridge access (deprecated, use only if needed)variables: a list of variables available in the fdb.source_grid_name: the grid name defined in aqua-grids.yaml to be used for areas and regriddingfixer_name: the name of the fixer defined in the fixes folderlevels: for 3D FDB data with alevelistin the request, this is the list of physical levels(e.g. [0.5, 10, 100, …] meters while levelist contains [1, 2, 3, …]).
deltat(optional): the cumulation window of fluxes in the dataset. This is a fixer option. If not present, the default is 1 second.fdb_info_file(optional): the path to the YAML file written by the workflow that can be used to inferdata_start_date,data_end_dateand other information as
bridge_start_dateandbridge_end_date. If not present, default values are used. It consists of two blocks, adatablock and abridgeblock. The first one contains the information for the entire simulation and it is mandatory, while the second one contains the information for the databridge and can be written only if the data are split between the FDB and the databridge.
If the
levelskey is defined, then retrieving 3D data is greatly accelerated, since only one level of each variable will actually have to be retrieved in order to define the Dataset.
Warning
For FDB sources the metadata section contains very important informations that are used to
retrieve the correct variables and levels.
Experiment metadata
It is highly recommended (but optional) to provide additional metadata for each experiment in the main.yaml file.
This information is particularly useful to documents aspects of experiments such as resolution, forcing type, autosubmit expid, etc.
These details are later used by the AQUA Dashboard for visualization of model results.
This can be done with an additional metadata key in the main.yaml file, as shown below:
sources:
historical-1990:
description: IFS-NEMO, historical 1990, tco1279/eORCA12 (a0h3)
metadata:
expid: a0h3
resolution_atm: tco1279
resolution_oce: eORCA12
forcing: historical
start: 1990
dashboard:
menu: historical 1990
resolution_id: SR
driver: yaml_file_cat
args:
path: '{{CATALOG_DIR}}/historical-1990.yaml'
All keys are optional, others could be freely added, the following are recommended:
expid: the autosubmit expid of the experiment, useful to uniquely identify it.resolution_atm: the atmospheric resolution of the experiment.resolution_oce: the oceanic resolution of the experiment.forcing: the forcing type of the experiment (examples are “historical”, “scenario ssp370”, etc).start: the starting year of the experiment.dashboard: a dictionary with additional information for the dashboard/aqua-web:menu: the name of the experiment as it will appear in the dashboard menu.resolution_id: a short string to identify the resolution of the experiment in the dashboard (LR, MR, SR, HR). This is an internal classification for aqua-web. Our convention is LR=about 144 km, MR=about 36 km, SR=about 25 km, SR=about 10 km, HR=about 5 km.
Regridding capabilities
In order to make use of the AQUA regridding capabilities we will need to define the way the grid are defined for each source.
AQUA is shipped with multiple grids definition, which are defined in the aqua/core/config/aqua-grids.yaml file.
In the following paragraphs we will describe how to define a new grid if needed.
Once the grid is defined, you can come back to this section to understand how to use it for your source.
Let’s imagine that for our yearly_SST source we want to use the lon-lat grid,
which is defined in the aqua/core/config/aqua-grids.yaml file and consists on a regular lon-lat grid.
In our case, we will need to add the following metadata to the yearly_SST.yaml file as source_grid_name.
yearly_SST:
description: amazing yearly_SST dataset
driver: yaml_file_cat
args:
path: "{{CATALOG_DIR}}/yearly_SST/main.yaml"
metadata:
source_grid_name: lon-lat
Grid definitions
As mentioned above, AQUA has some predefined grids available in the aqua/core/config/grids folder.
Here below we provide some information on the grid key so that it might me possibile define new grids.
As an example, we use the healpix grid for ICON and tco1279 for IFS:
icon-healpix:
path:
2d: '{{grids}}/HealPix/icon_hpx{zoom}_atm_2d.nc' # this is the default 2d grid
2dm: '{{grids}}/HealPix/icon_hpx{zoom}_oce_2d.nc' # this is an additional and optional 2d grid used if data are masked
depth_full: '{{grids}}/HealPix/icon_hpx{zoom}_oce_depth_full.nc'
depth_half: '{{grids}}/HealPix/icon_hpx{zoom}_oce_depth_half.nc'
masked: # This is the attribute used to distinguish variables which should go into the masked category
component: ocean
space_coord: ["cell"]
tco1279:
path:
2d: '{{grids}}/IFS/tco1279_grid.nc'
2dm: '{{grids}}/IFS/tco1279_grid_masked.nc'
masked_vars: ["ci", "sst"]
Note
Two kinds of template replacement are available in the files contained in the aqua/core/config/grids folder.
The Jinja formatting {{ var }} is used to set variables as path that comes from the catalog.yaml file.
The default python formatting {} is used for file structure which comes
Reader arguments, as model, experiment or any other kwargs the user might set.
Please pay attention to which one you are using in your files.
In the future we will try to uniform this towards the Jinja formatting.
path: Path to the grid data file, can be a single file if the grid is 2d, but can include multiple files as a function of the grid used.2drefers to the default grids,2dmto the grid for masked variables, any other key refers to specific 3d vertical masked structures, asdepth_full,depth_half,level, etc.space_coord: The space coordinate how coordinates are defined and used for interpolation. There is an automatic guessing routine, but this is a bit costly so it is better to specify this if possible.masked: (if applicable): Keys to define variables which are masked. When using this, the code will search for an attribute to make the distinction (component: oceanin this case). In alternative, if you want to apply masking only on a group of variables, you can definedvars: [var1, var2]. In all the cases, the2dmgrid will be applied to the data.cdo_extra: (if applicable): Additional CDO command to be used to process the files defined inpath.cdo_options: (if applicable): Additional CDO options to be used to process the files defined inpath.cellareas,cellareas_var: (if applicable): Optional path and variable name where to specify a file to retrieve the grid area cells when the grid shape is too complex for being automatically computed by CDO.regrid_method: (if applicable): Alternative CDO regridding method which is not theycondefault. To be used when grid corners are not available. Alterntives might bebil,bicornn.
Other simpler grids can be defined using the CDO syntax, so for example we have r100: r360x180.
Further CDO compatible grids can be of course defined in this way.
A standard lon-lat grid is defined for basic interpolation and can be used for most of the regular cases,
as long as the space_coord are lon and lat.
Compact catalogs with YAML override
In order to avoid having to write the same catalog entry for each source, in AQUA we can use the YAML override functionality also for the intake catalogs. This allows to write the full rquest information only for a first base catalog source and then define the following ones as copies of the first, overriding only the keys that are different.
For example, let’s imagine that we have a first source called hourly-native
that is defined as:
sources:
hourly-native: &base-default
description: hourly data on native grid TCo1279 (about 10km).
args: &args-default
request: &request-default
class: d1
resolution: high
[ ... other request parameters ... ]
data_start_date: 19900101T0000
data_end_date: 19941231T2300
chunks: D
[ ... other keys ... ]
metadata: &metadata-default
fdb_path: [ ... some path to the FDB ... ]
eccodes_path: [ ... some path to the eccodes ... ]
[ ... other keys ... ]
We can then define a second source as a copy of the first one, specifying only what is different:
hourly-r025:
<<: *base-default
description: hourly 2D atmospheric data on regular r025 grid (1440x721).
args:
<<: *args-default
request:
<<: *request-default
resolution: standard
metadata:
<<: *metadata-default
fdb_path: [ ... some different path to the FDB ... ]
This second source will have the same keys as the first one, except for the ones that are explicitly overridden.
Intake capabilities and kwargs data access
Intake ships a template replacement capabilities based on Jinja2 which is able to “compress” multiple sources. This is combined by the capacity of AQUA of elaborating extra arguments which goes beyond the classical model-exp-source hierarchy For example, we could assume we have a FDB source as the one above. However, this sources is made by multiple ensemble members, and we want to described this in the catalog. This is something intake can easily handle with the Jinja {{ }}` syntax.
sources:
hourly-native:
args:
request:
domain: g
class: rd
expver: a06x
realization: '{{ realization }}'
...
driver: gsv
parameters:
realization:
allowed: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
description: realization member
type: int
default: 1
This can be later accessed via the reader providing an extra argument, or kwargs in python jargon, which define the realization
from aqua import Reader
reader = Reader(model="IFS", exp="control-1950-devcon", source="hourly-native", realization=5)
data = reader.retrieve(var='2t')
This will load the realization number 5 of my experiment above. Of course, if we do not specify the realization in the Reader() call a default will be provided, so in the case above the number 1 will be loaded.
This capacity can be tuned to multiple features according to source characteristics, and will be further expanded in the future.
Warning
Some kwargs might have an impact on the resolution of the data, and consequently on the grid file name and format. An example is the zoom key used for some ICON data.
In this case, AQUA will modify the file templates accordingly. If this modification is required or not can be controlled through the
variable default_weights_areas_parameters in the reader.py module.
This is a test feature and will be expanded in the future.
DE_340 source syntax convention
Although free combination of model-exp-source can be defined in each catalog to get access to the data, inside DE_340 a series of decision has been taken to try to homogenize the definition of experiments and of sources. We decide to use the dash (-) to connect the different elements of the syntax below.
Models (model key)
This will be simply one of the three coupled models used in the project: IFS-NEMO, IFS-FESOM and ICON. Since version v0.5.2 we created coupled models catalog entries, though only on Lumi. Analysing specific atmosphere-only or oceanic-only runs will still be possible.
Experiments (exp key)
Considering that we have strict set of experiments that must be produced, we will follow this 3-string convention:
Experiment kind: historical, control, sspXXX
Starting year: 1950, 1990, etc…
Extra info (optional): any information that might be important to define an experiment, as dev, test, the expid of the simulation, or anything else that can help for defining the experiment.
Examples are historical-1990-dev or control-1950-dev. For test experiments, we use simply the expid of the experiment
Sources (source key)
For the sources, we decide to uniform the different requirements of grids and temporal resolution.
Domain: Oceanic sources will have a oce prepended to all their sources
Time resolution: monthly, daily, 6hourly, hourly, etc.
Space resolution: native, 1deg, 025deg, r100, etc… For some oceanic model we could add the horizontal grid so native-elem or native-gridT` could be an option. Similarly, if multiple healpix are present, they can be healpix-0 or healpix-6 in the case we want to specify the zoom level.
Extra info: 2d or 3d. Not mandatory, but to be used when confusion might arise.