Other core functionalities

Time Statistics

Input data may not be available at the desired time frequency. It is possible to perform time statistics, including time averaging, minimum, maximum and standard deviation at a given time frequency by using the Timstat() class and its method timstat()` which allow for several statistical operations. The class is nested into the Reader, and its method are exposed so that is sufficient to use timstat() and its sibilings timmean(), timmin(), timmax(), timsum(), timfirst(), timlast() and timstd(), as in the case below.

from aqua import Reader
reader = Reader(model="IFS", exp="tco2559-ng5", source="ICMGG_atm2d")
data = reader.retrieve()
daily = reader.timmean(data, freq='daily')
# alternatively: daily = reader.timstat(data, stat='mean', freq='daily')

Data have now been averaged at the desired daily timescale. Similarly operations can be performed with others methods. timfirst() and timlast() can be used to select the first and last record in each group after resampling the data at a given frequency.

Note

The TimStat() class supposes that the input data has a time coordinate and converts it to a Gregorian calendar if needed.

Warning

If you do not specify the freq argument, the statistical operation will be operated on the entire dataset!

Some extra options are available:

exclude_incomplete=True: this flag will remove averaged chunks which are not complete (for example, verify that all the record from each month are available before doing the time mean).
center_time=True: this flag will center the time coordinate on the mean time window. Otherwise, the time coordinate will be the first timestamp of the time window.
time_bounds=True: this flag can be activated to build time bounds in a similar way to CMOR-like standard.

The timhist()``method is also available as a method of the ``Reader() class, passsing through the TimStat() class, so that it is easy to compute histograms on time-resampled data:

hist = reader.timhist(data['t2m'], freq="1D", bins=100, range=(250, 350), units='K')

or

hist = data['t2m'].aqua.timhist(freq="1D", bins=100, range=(250, 350), units='K')

When no time frequency information is provided, this method operates on the full time series, providing results identical to the histogram() method. See the histogram() section below for more details on the available options.

Warning

We are aware of issues when using data with non nanosecond time resolution (not datetime64[ns]) in combination with the extra options of the timstat methods. We suggest to switch them off when using such data until further notice.

Detrend

For some analysis, computing or removing a linear (or polynominial) trend can be helpful to highlight the internal variability. The detrend method can be used as a high-level wrapper of xarray polyfit functionalities to achieve this goal.

from aqua import Reader
reader = Reader(model="IFS", exp="tco2559-ng5", source="ICMGG_atm2d")
data = reader.retrieve()
daily = reader.detrend(data['2t'], dim='time')

In this way, linear trend is removed from each grid point of the original dataset along the time dimension. Other dimension can be targeted too, although with limited physical meaning. Of course, it can be used in collaboration with temporal and spatial averaging. Higher order polynominial fits are available too.

Some options includes:

degree: this will define with an integer the order of the polynominial fit. Default is 1, i.e. linear detrending.
skipna=True: removing the NaN from the fit. Default is True.

Warning

Trend and detrend might lead to incorrect results if there is not an equal amount of time elements (e.g. same amount of months or days) in the dataset.

It is also possible to evaluate the coefficients of the fit by calling the trender.coeffs() method. This will call the coeffs() method of the Trender() class, which is used internally by the detrend() method. A dataarray with the coefficients will be returned, with the same dimensions as the original data.

Spatial Selection

The AreaSelection() class, part of the aqua.core.fldstat module, allows to select a specific region of the domain based on latitude and longitude limits or based on predefined regions from the regionmask package. The select_area() method can be used to perform the selection on a DataArray or Dataset.

It is possible to consider or drop the limits of the selection by setting the box_brd flag to True or False respectively. It is also possible to drop the NaN values after the selection by setting the drop flag to True.

The class is nested into the Reader() class, so it is possible to call the select_area() method directly from the reader instance or as aqua accessor.

An example of usage with regionmask predefined regions is:

from aqua import Reader
import regionmask

reader = Reader(catalog='ci', model='ERA5', exp='era5-hpz3', source='monthly', regrid='r100', loglevel='INFO')
data = reader.retrieve()
data = reader.regrid(data)

region = regionmask.defined_regions.natural_earth_v5_0_0.countries_110
selected = reader.select_area(data['2t'], region=region, region_sel=['United States of America', 'Russia'])
selected.isel(time=0).plot()

Note

When selecting a region that crosses the Greenwich meridian (e.g. lon_limits=[350, 10]), the method will automatically convert longitudes to the -180 to 180 range for the selection, in order to allow flawless plotting. The option can be disabled by setting the to_180 flag to False.

Warning

In order to apply an area selection the data Xarray must include lon and lat as coordinates. It can work also on unstructured grids, but information on coordinates must be available. If the dataset does not include these coordinates, this can be achieved with the fixer described in the Fixer functionalities section.

Spatial Statistics

The FldStat() class and its method fldstat() are used to do spatial operations and similary as for TimStat() does for time. Statistical operations can be area-weighted if the class is initialized with an xarray dataset containing the areas of the corresponding grid.

The class is nested into the Reader(), which computes/load the areas of the corresponding source at the initialization. Thus when calling for example reader.fldmean() method area-weighted spatial averaging will be performed. The methods are exposed so that is sufficient to use fldstat(data, stat=statname) with statname being a string such as: mean, min, max, sum, std, integral, areasum; Otherwise the relative sibilings can be called fldmean(), fldmin(), fldmax(), fldsum(), fldstd(), fldintg(), fldarea(). For example, if we run the following commands:

from aqua import Reader
reader = Reader(model="PSC", exp="PIOMAS", source="monthly", regrid="r100", loglevel='info')
data = reader.retrieve()
regrid_sithick = reader.regrid(data['sithick'])
global_thick_mean = reader.fldmean(regrid_sithick)
# alternatively: global_thick_mean = reader.fldstat(regrid_sithick, stat='mean')

we get a time series of the global average sithick.

It is also possible to apply a regional section to the domain before performing the averaging. This will internally use the AreaSelection() class described in the Spatial Selection section.

tprate = data.tprate
global_mean = reader.fldmean(tprate, lon_limits=[-50, 50], lat_limits=[-10,20])

Warning

In order to apply an area selection the data Xarray must include lon and lat as coordinates. It can work also on unstructured grids, but information on coordinates must be available. If the dataset does not include these coordinates, this can be achieved with the fixer described in the Fixer functionalities section.

Also, if you do not specify the dims argument (e.g. dims=['lon']), the statistical operation will be operated on both the (automatically found) horizontal dimensions of the dataset!

The same field statistics methods are also available through the aqua accessor on xarray objects, mirroring the behaviour of the Reader methods. For example:

global_thick_mean = regrid_sithick.aqua.fldmean()
# or with an explicit choice of statistic and regional selection:
regional_sum = regrid_sithick.aqua.fldstat(stat='sum', lon_limits=[-50, 50], lat_limits=[-10, 20])

Histogram

The histogram() function is a simple utility to compute histograms of input data, working with a Xarray DataArray in input. The syntax is similar to the numpy.histogram() function, but it returns a DataArray with the histogram. If the input DataArray is Dask-based then the computation will be lazy. If a Dataset is passed, the first variable will be used. Latitudinal weighting is activated by default, so the ‘counts’ of the histogram will not be integers. This can be overridden by setting the weighted=False argument or by passing a weights DataArray to the weights argument. It is possible to compute a complete PDF (Probability Density Function) by setting the density=True argument.

from aqua import histogram, Reader

reader = Reader(model="IFS", exp="tco2559-ng5", source="ICMGG_atm2d")
data = reader.retrieve()
hist = histogram(data['t2m'], bins=100, range=(250, 350), units='K')

This will return a DataArray with the histogram of the t2m variable, with 100 bins ranging from 250 to 350, with area weighting (on by default) after converting the units to ‘K’ (if needed) In the output DataArray the center_of_bin and width coordinates contains the center and width of each bin respectively. The bins and range arguments are recommended, all others totally optional.

Some extra options are available:

weights=xr.DataArray: this will pass a weights DataArray to the histogram computation.
weighted=False: this will switch off computing a weighted histogram, where the weights are the cosines of the latitudes (true by default).
density=True: this will compute a PDF (Probability Density Function) instead of a histogram (false by default).
dask=False: this will force the computation of the histogram using numpy (true by default).
check=True: this will perform a test to verify that the sum of the counts is equal to the number of elements in the input data.
It will fail if not appropriate bounds are used for the classes. Can be only used if the density flag is False. It will force a computation of the histogram and a numpy array will be returned.

Time selection

Even if slicing your data after the retrieve() method is an easy task, being able to perform a time selecetion during the Reader initialization can speed up your code, having less metadata to explore. For this reason startdate and enddate options are available both during the Reader initialization and the retrieve() method to subselect immediatly only a chunck of data.

Note

If you’re streaming data check the section Streaming of data to have an overview of the behaviour of the Reader with these options.

Level selection

Similarly to Time selection, level selection is a trivial operation, but when dealing with high-resolution 3D datasets, only ask for the required levels can speed up the retrieve process.

When reading 3D data it is possible to specify already during retrieve() which levels to select using the level keyword. The levels are specified in the same units as they are stored in the archive (for example in hPa for atmospheric IFS data, but an index for NEMO data in the FDB archive).

Note

In the case of FDB data this presents the great advantage that a significantly reduced request will be read from the FDB (by default all levels would be read for each timestep even if later a sel() or isel() selection is performed on the XArray).

Warning

If you’re dealing with level selection and regridding, please take a look at the section Level selection and regridding.

Streaming of data

The Reader class includes the ability to simulate data streaming to retrieve chunks of data of a specific time length.

To activate the streaming mode the user should specify the argument streaming=True in the Reader initialization. The user can also choose the length of the data chunk with the aggregation keyword (e.g. in pandas notation, or with aliases as daily, monthly etc. or days, months etc.). The default is S (step), i.e. single saved timesteps are read at each iteration. The user can also specify the desired initial and final dates with the keywords startdate and enddate.

If, for example, we want to stream the data every three days from '2020-05-01', we need to call:

from aqua import Reader
reader = Reader(model="IFS", exp= "tco2559-ng5", source="ICMGG_atm2d",
                streaming=True, aggregation = '3D', startdate = '2020-05-01')
data = reader.retrieve()

The data available with the first retrieve will be only 3 days of the available times. The retrieve() method can then be called multiple times, returning a new chunk of 3 days of data, until all data are streamed. The function will automatically determine the appropriate start and end points for each chunk based on the internal state of the streaming process.

If we want to reset the state of the streaming process, we can call the reset_stream() method.

Accessors

AQUA also provides a special aqua accessor to Xarray which allows to call most functions and methods of the reader class as if they were methods of a DataArray or Dataset.

Basic usage

Reader methods like reader.regrid() or functions like plot_single_map() can now also be accessed by appending the suffix aqua to a DataArray or Dataset, followed by the function of interest, like in data.aqua.regrid().

This means that instead of writing:

reader.fldmean(reader.timmean(data.tcc, freq="Y"))

we can write:

data.tcc.aqua.timmean(freq="Y").aqua.fldmean()

Note

The accessor always assumes that the Reader instance to be used is either the one with which a Dataset was created or, for new derived objects and for DataArrays of a Datasets, the last instantiated Reader or the last use of the retrieve() method. This means that if more than one reader instance is used (for example to compare different datasets) we recommend not to use the accessor.

Usage with multiple Reader instances

As an alternative the Reader class contains a special set_default() method which sets that reader as an accessor default in the following. The accessor itself also has a set_default() method (accepting a reader instance as an argument) which sets the default and returns the same object.

Usage examples when multiple readers are used:

from aqua import Reader
reader1=Reader(model="IFS", exp="test-tco79", source="short", regrid="r100")  # the default is now reader1
reader2=Reader(model="IFS", exp="test-tco79", source="short", regrid="r200")  # the default is now reader2
data1 = reader1.retrieve()  # the default is now reader1
data2 = reader2.retrieve()  # the default is now reader2
reader1.set_default()  # the default is now reader1
data1r = data1.aqua.regrid()
data2r = data2.aqua.regrid()  # data2 was created by retrieve(), so it remembers its default reader
data2r = data2['2t'].aqua.set_default(reader2).aqua.regrid()  # the default is set to reader2 before using a method

Parallel Processing

Since most of the objects in AQUA are based on xarray, you can use parallel processing capabilities provided by xarray through integration with dask to speed up the execution of data processing tasks.

For example, if you are working with AQUA interactively in a Jupyter Notebook, you can start a dask cluster to parallelize your computations.

from dask.distributed import Client
import dask
dask.config.config.get('distributed').get('dashboard').update({'link':'{JUPYTERHUB_SERVICE_PREFIX}/proxy/{port}/status'})

client = Client(n_workers=40, threads_per_worker=1, memory_limit='5GB')
client

The above code will start a dask cluster with 40 workers and one thread per worker.

Data provenance

The implementation of data provenance is under development and only basic functionalities are available. A set of metadata is stored as AQUA_metadata_name in the attributes of the xarray.Dataset retrieved by the Reader. The metadata includes information about the model, experiment, source and AQUA version used to retrieve the data.

Additionally the history attribute is updated with details of the operations performed on the data. This is done by the Reader and it is under implementation for diagnostics as well.

A function log_history is available in the aqua.core.logger module to log in this attribute for a specific dataset. This can be used as well to log the history of the operations performed on the data in custom scripts.