DROP - Data Reduction OPerator

DROP (Data Reduction OPerator) is a comprehensive tool within the AQUA framework designed to extract, process, and organize data from any climate dataset.

Warning

Starting from AQUA v0.18, DROP is replacing Lor Resolution Archive (LRA) in most of the terminology, since DROP is a more generic tool that deals with many more options other than just LRA. However, LRA is still a key use case for DROP (see below).

What is DROP?

DROP is a comprehensive data reduction operator that combines the regridding, fixing, and time averaging capabilities included in AQUA. The Drop class uses dask to exploit parallel computations and can process any supported dataset, and it serves as a general-purpose data reduction platform.

DROP Capabilities

DROP’s architecture enables various data processing tasks:

Temporal Processing:

  • Custom frequency resampling (any frequency to any frequency)

  • Multiple statistics: mean, std, max, min

  • Handling of incomplete time chunks

Spatial Processing:

  • Regridding to any supported resolution or native grid

  • Regional data extraction with configurable boundaries

  • Support for both regular and irregular grids

Data Management:

  • Automatic catalog entry generation for DROP-generated outputs

  • Zarr reference creation for faster access

  • Parallel processing with configurable workers

  • Memory-efficient chunked processing

Example use cases:

  • Extract daily European data from global monthly archives

  • Convert model output from native grid to regular 0.25° grid

  • Create statistical summaries (std, max, min) instead of just means

  • Process specific ensemble members or realizations

DROP can be explored in the DROP notebook.

The Low Resolution Archive (LRA) Context

The Low Resolution Archive is a key use case for DROP. The LRA is an intermediate layer of data reduction that simplifies analysis of extreme high-resolution data by providing monthly data 1 degree resolution, permitting reduced storage and computational requirements.

Note

LRA built available on Levante and Lumi by AQUA team are all at r100 (i.e. 1 deg resolution) and at monthly frequency. The corresponding catalog entry name is lra-r100-monthly.

Source Naming Convention

DROP automatically generates catalog source names following a consistent pattern:

Standard naming format:

  • Pattern: {resolution}-{frequency}

  • Examples:

    • r100-monthly (1° resolution, monthly frequency)

    • r100-daily (1° resolution, daily frequency)

    • r25-monthly (0.25° resolution, monthly frequency)

Default LRA source:

  • lra-r100-monthly

Zarr variants: All sources have corresponding Zarr reference versions with -zarr suffix:

  • r100-monthly-zarr

  • r100-daily-zarr

Resolution codes:

  • r100 = 1° (100km approximately)

  • r25 = 0.25° (25km approximately)

  • native = original model grid

Parameter-based access: Different processing options can be accessed via Reader kwargs:

# Access specific statistics (if generated)
reader = Reader(model="IFS-NEMO", exp="historical-1990",
               source="r100-monthly", stat="std")

# Access regional data (if generated)
reader = Reader(model="IFS-NEMO", exp="historical-1990",
               source="r100-monthly", region="europe")

# Access specific ensemble realizations
reader = Reader(model="IFS-NEMO", exp="historical-1990",
               source="r100-monthly", realization="r2")

Accessing DROP-generated data

Once DROP has processed the data, generated outputs can be accessed via the standard Reader interface using the automatically created catalog sources.

from aqua import Reader
reader = Reader(model="IFS-NEMO", exp="historical-1990", source="lra-r100-monthly")
data = reader.retrieve()

Advanced access patterns:

# Access standard deviation instead of mean
reader = Reader(model="ERA5", exp="era5", source="r100-monthly", stat="std")
std_data = reader.retrieve()

# Access regional European data
reader = Reader(model="IFS-NEMO", exp="historical-1990",
               source="r25-daily", region="europe")
eu_data = reader.retrieve()

# Access specific ensemble member
reader = Reader(model="IFS-NEMO", exp="historical-1990",
               source="r100-daily", realization="r3")
member_data = reader.retrieve()

Zarr access for faster performance:

You can access data using Zarr reference files for improved performance, when available:

# Faster access using Zarr references
reader = Reader(model="IFS-NEMO", exp="historical-1990", source="r100-monthly-zarr")
data = reader.retrieve()

Note

The specific source names depend on the resolution and frequency you configured when running DROP. See the “Source Naming Convention” section above for details.

Warning

Zarr reference access is experimental and may not work with all experiment configurations.

Using DROP to process data

DROP processes data through a command line interface (CLI) available with the subcommand aqua drop.

Configuration is done via a YAML file that can be built from the drop_config.tmpl, available in the .aqua/templates/drop folder after installation. The configuration file allows you to specify:

  • Target resolution and frequency

  • Variables to process

  • Regional boundaries (optional)

  • Output and temporary directories

  • SLURM options and number of workers

Configuration structure:

The configuration follows the model-exp-source 3-level hierarchy in the data dictionary. Key configuration options include:

  • vars: variables to process

  • resolution: target spatial resolution (e.g., r100, r25, native)

  • frequency: target temporal frequency (e.g., monthly, daily, 3hourly)

  • stat: statistic to compute (mean, std, max, min)

  • region: spatial subsetting configuration

Warning

Catalog detection is automatic, but specify the catalog name explicitly in the configuration file if you have identically named triplets in different catalogs.

Usage

aqua drop <options>

Options: these override the configuration file options.

-c CONFIG, --config CONFIG

Set up a specific configuration file

-d, --definitive

Run the code and produce the data (a dry-run will take place if this flag is missing)

-f, --fix

Set up the Reader fixing capabilities (default: True)

-w, --workers

Set up the number of dask workers (default: 1, i.e. dask disabled)

-l, --loglevel

Set up the logging level.

-o, --overwrite

Overwrite existing data (default: WARNING).

--monitoring

Enable a single chunk run to produce the html dask performance report. Dask should be activated.

--only-catalog

Will generate/update only the catalog entry for DROP, without running the code for generating DROP output itself

--rebuild

This option will force the rebuilding of the areas and weights files for the regridding. If multiple variables or members are present in the configuration, this will be done only once.

--stat

Statistic to be computed (default: ‘mean’)

--frequency

Frequency of the DROP output (default: as the original data)

--resolution

Resolution of the DROP output (default: as the original data)

--realization

Which realization (e.g. ensemble member) to use for the DROP output (default: ‘r1’)

Examples:

Process data to create monthly 1° resolution output:

aqua drop -c drop_config.yaml -d -w 4

Generate daily data at 0.25° resolution with 8 workers:

aqua drop -c drop_config.yaml -d -w 8 --resolution r25 --frequency daily

Warning

Keep in mind that this script is ideally submitted via batch to a HPC node, so that a template for SLURM is also available in the same directory (.aqua/templates/drop/drop-submitter.tmpl). Be aware that although the computation is split among different months, the memory consumption of loading very big data is a limiting factor, so that unless you have very fat node it is unlikely you can use more than 16 workers.

Output:

After processing, new catalog entries are automatically created following the naming convention described above, allowing immediate access to your processed data.

Parallel DROP tool

Using DROP can be a memory-intensive task, that cannot be easily parallelized within a single job. For processing multiple variables or large datasets, use the parallel execution script cli_drop_parallel_slurm.py to submit multiple SLURM jobs simultaneously:

./cli_drop_parallel_slurm.py -c drop_config.yaml -d -w 4 -p 4

This processes data using 4 workers per node with up to 4 concurrent SLURM jobs. It builds on Jinja2 template replacement from a typical SLURM script aqua_drop.j2. For now it is configured only to be run on LUMI but further development should allow for larger portability.

A -s option to call the run via container instead of using the local installation.

Warning

Use with caution. This script rapidly submits tens of jobs to the SLURM scheduler!