DROP - Data Reduction OPerator

DROP (Data Reduction OPerator) is a comprehensive tool within the AQUA framework designed to extract, process, and organize data from any climate dataset.

What is DROP?

DROP is a comprehensive data reduction operator that combines the regridding, fixing, and time averaging capabilities included in AQUA. The Drop class uses dask to exploit parallel computations and can process any supported dataset, and it serves as a general-purpose data reduction platform.

DROP Capabilities

DROP’s architecture enables various data processing tasks:

Temporal Processing:

Custom frequency resampling (any frequency to any frequency)
Multiple statistics: mean, std, max, min, sum or histogram
Handling of incomplete time chunks
Support for kwargs to specify arguments of callable statistics (e.g. histogram bins and range)

Spatial Processing:

Regridding to any supported resolution or native grid
Regional data extraction with configurable boundaries
Support for both regular and irregular grids

Data Management:

Automatic catalog entry generation for DROP-generated outputs
Output in NetCDF, Zarr, and icechunk formats for flexible access
Parallel processing with configurable workers
Memory-efficient chunked processing

Example use cases:

Extract daily European data from global monthly archives
Convert model output from native grid to regular 0.25° grid
Create statistical summaries (std, max, min) instead of just means
Process specific ensemble members or realizations

DROP can be explored in the DROP notebook.

Note

DROP is designed to be flexible and can be used for a wide range of data reduction tasks beyond the specific use cases mentioned above. However, the processing window and output file are always based on monthly chunks.

The Low Resolution Archive (LRA) Context

The Low Resolution Archive is a key use case for DROP. The LRA is an intermediate layer of data reduction that simplifies analysis of extreme high-resolution data by providing monthly data 1 degree resolution, permitting reduced storage and computational requirements.

Note

LRA built available on Levante and Lumi by AQUA team are all at r100 (i.e. 1 deg resolution) and at monthly frequency. The corresponding catalog entry name is lra-r100-monthly.

Source Naming Convention

DROP automatically generates catalog source names following a consistent pattern:

Standard naming format:

Pattern: {resolution}-{frequency}
Examples:
- r100-monthly (1° resolution, monthly frequency)
- r100-daily (1° resolution, daily frequency)
- r25-monthly (0.25° resolution, monthly frequency)

Default LRA source:

lra-r100-monthly

Zarr variants: All sources have corresponding Zarr reference versions with -zarr suffix:

r100-monthly-zarr
r100-daily-zarr

Resolution codes:

r100 = 1° (100km approximately)
r25 = 0.25° (25km approximately)
native = original model grid

Parameter-based access: Different processing options can be accessed via Reader kwargs:

# Access specific statistics (if generated)
reader = Reader(model="IFS-NEMO", exp="historical-1990",
               source="r100-monthly", stat="std")

# Access regional data (if generated)
reader = Reader(model="IFS-NEMO", exp="historical-1990",
               source="r100-monthly", region="europe")

# Access specific ensemble realizations
reader = Reader(model="IFS-NEMO", exp="historical-1990",
               source="r100-monthly", realization="r2")

Accessing DROP-generated data

Once DROP has processed the data, generated outputs can be accessed via the standard Reader interface using the automatically created catalog sources.

from aqua import Reader
reader = Reader(model="IFS-NEMO", exp="historical-1990", source="lra-r100-monthly")
data = reader.retrieve()

Advanced access patterns:

# Access standard deviation instead of mean
reader = Reader(model="ERA5", exp="era5", source="r100-monthly", stat="std")
std_data = reader.retrieve()

# Access regional European data
reader = Reader(model="IFS-NEMO", exp="historical-1990",
               source="r25-daily", region="europe")
eu_data = reader.retrieve()

# Access specific ensemble member
reader = Reader(model="IFS-NEMO", exp="historical-1990",
               source="r100-daily", realization="r3")
member_data = reader.retrieve()

Zarr access for faster performance:

You can access data using Zarr stores files for improved performance, when available:

# Faster access using Zarr references
reader = Reader(model="IFS-NEMO", exp="historical-1990", source="lra-r100-monthly-zarr")
data = reader.retrieve()

Note

The specific source names depend on the resolution and frequency you configured when running DROP. See the “Source Naming Convention” section above for details.

Using DROP to process data

DROP processes data through a command line interface (CLI) available with the subcommand aqua drop.

Configuration is done via a YAML file that can be built from the drop_config.tmpl, available in the .aqua/templates/drop folder after installation. The configuration file allows you to specify:

Target resolution and frequency
Variables to process
Regional boundaries (optional)
Output and temporary directories
SLURM options and number of workers

Configuration structure:

The configuration follows the model-exp-source 3-level hierarchy in the data dictionary. Key configuration options include:

vars: variables to process
resolution: target spatial resolution (e.g., r100, r25, native)
frequency: target temporal frequency (e.g., monthly, daily, 3hourly)
stat: statistic to compute (mean, std, max, min)
region: spatial subsetting configuration
engine: The engine used for the GSV retrieval, options are ‘fdb’ and ‘polytope’.

Warning

Catalog detection is automatic, but specify the catalog name explicitly in the configuration file if you have identically named triplets in different catalogs.

Configuration File

The DROP configuration file is structured in YAML format with four main sections: target, paths, options, slurm, and data. Below is a detailed explanation of each configuration parameter.

Target Section

The target section defines the primary output characteristics for the DROP processing:

target:
  resolution: r100
  frequency: monthly
  catalog: my_catalog
  startdate: "2020-01-01T00:00:00"
  enddate: "2020-12-31T23:00:00"
  region:
    name: Europe
    lat: [35, 70]
    lon: [-10, 40]
  level: [850, 500]
  stat: mean
  stat_kwargs: {}
  regrid_first: False

resolution (string, required): Target spatial resolution for regridding.
- r100: 1° resolution (~100km) or any other supported target grid (see Available target grids)
- native: Keep original model grid (no regridding)
frequency (string, required): Target temporal frequency for output.
- monthly, daily, 3hourly, 6hourly, hourly
- Any valid frequency string supported by pandas resample
- If not specified, keeps original data frequency
catalog (string, optional): Name of the catalog to process.
- It will be used for all the models listed in the data section.
startdate (string, optional): Starting date for data processing.
- Format: YYYY-MM-DD or any valid date string parsable by pandas
- Example: "2020-01-01"
- If omitted, processes from the first available date
enddate (string, optional): Ending date for data processing.
- Format: YYYY-MM-DD or any valid date string parsable by pandas
- Example: "2020-12-31"
- If omitted, processes until the last available date
region (dict, optional): Spatial subsetting configuration. If omitted, processes global data.
- name (string): Region identifier (e.g., Europe, Tropics)
- lat (list): Latitude range as [min, max] (e.g., [35, 70])
- lon (list): Longitude range as [min, max] (e.g., [-10, 40])
- drop (bool, optional): Whether to drop missing values in the region selection. Default: False
level (int, float or list, optional): Vertical levels to select (e.g., pressure levels like [850, 500] or model-specific levels). Default: None
- If specified, only these levels will be processed. If omitted, all levels are included.
stat (string, optional): Statistical operator for temporal aggregation. Default: mean
- mean: Arithmetic mean
- std: Standard deviation
- max: Maximum value
- min: Minimum value
- sum: Sum of values
- histogram: Compute histogram (requires stat_kwargs to specify the range argument)
stat_kwargs (dict, optional): Additional arguments for the statistical function. Default: {}
- For histogram e.g.: {bins: 20, range: [0, 100]}
- Empty dict or missing line for other statistics that don’t require additional arguments
regrid_first (bool, optional): Whether to apply regridding (and region selection) before time statistics. Default: False
- For some statistics (e.g., histogram), it may be necessary to regrid the data before applying the statistic
because the statistic can disrupt the spatial dimensions required for regridding.

Paths Section

Defines the directory structure for outputs and temporary files:

paths:
  outdir: /path/to/output
  tmpdir: /path/to/tmp

outdir (string, required): Directory where final DROP outputs will be stored.
- Should have sufficient space for processed data
- Subdirectories are automatically created based on catalog/model/exp/source hierarchy
tmpdir (string, required): Directory for temporary files during processing.
- Must be on fast storage (ideally local to compute node)
- Should have space for intermediate monthly files and aggregated yearly files

Options Section

Controls processing behavior and performance settings:

options:
  engine: fdb
  loglevel: INFO
  driver: netcdf
  overwrite: False
  rebuild: False
  compact: cdo
  performance_reporting: False

engine (string, optional): Data retrieval engine. Default: fdb
- needed only for GSV retrieval, options are ‘fdb’ and ‘polytope’
- fdb: Fields DataBase, you should be on the same machine where the database is located
- polytope: Polytope service (remote access). Be sure to have the correct credentials and network access to use this option.
loglevel (string, optional): Logging verbosity. Default: WARNING
- Available levels: DEBUG, INFO, WARNING, ERROR
driver (string, optional): Format for the output files. Default: netcdf
- netcdf: Create NetCDF files. Monthly files are always created, but if compact is set to xarray or cdo (see below), they will be concatenated into yearly files and the monthly files will be deleted.
- zarr: Create Zarr datasets files for faster subsequent access. Test feature under development, use with caution. Monthly files are created and then concatenated into yearly consolidate files, and monthly files are removed. This is suboptimal but provides safety against incomplete or corrupted files.
- icechunk: Write all data into a single git-like versioned Zarr repository using icechunk. Every month is committed as an atomic snapshot; failed writes are automatically rolled back to the last clean commit. A post-commit integrity check is performed after each month.
Warning

Experimental feature. icechunk output is experimental and not compatible with AQUA catalog integration. Running aqua drop with --driver icechunk (or driver: icechunk in the config file) will skip the automatic create_catalog_entry step, meaning the output cannot be accessed via Reader using a catalog source name. Direct access via icechunk.Repository.open and xr.open_zarr is required. Do not use in production pipelines until this limitation is resolved.
overwrite (bool, optional): Overwrite existing output files. Default: False
- True: Replace existing files
- False: Skip processing if files exist
- DROP checks if the existing files are complete before skipping, so it won’t skip if files are incomplete or corrupted
exclude_incomplete (bool, optional): Exclude incomplete temporal chunks. Default: False
- True: Drop months/periods with missing data
- False: Process all available data
rebuild (bool, optional): Force rebuilding of regridding weights. Default: False
- True: Regenerate area and weight files
- False: Use cached weights if available
- Set to True if you suspect weights are outdated (e.g., after a major update to CDO or AQUA)
compact (string, optional): Method for concatenating monthly files into yearly files. Only relevant when driver: netcdf. Default: cdo
- xarray: Use xarray for concatenation
- cdo: Use Climate Data Operators
- null or omit: No compacting, keep monthly files
performance_reporting (bool, optional): Generate Dask performance HTML report. Default: False
- True: Create detailed performance report for one chunk. Then the job will stop.
- False: No performance monitoring

SLURM Section

Configuration for HPC job submission (used by parallel DROP tools):

slurm:
  partition: standard
  username: myuser
  account: myproject
  time: "02:00:00"
  mem: "64GB"

partition (string): SLURM partition name (e.g., standard, compute, large-mem)
username (string): Your HPC username
account (string): Project or account name for billing
time (string): Maximum wall time (format: HH:MM:SS)
mem (string): Memory allocation per job (e.g., 64GB, 128GB)

Data Section

Defines the hierarchical structure of data to process. They have all to be inside the same catalog specified in the target section.

data:
  MODEL_NAME:
    EXPERIMENT_NAME:
      SOURCE_NAME:
        vars: ['var1', 'var2', 'var3']
        workers: 12
        realizations: [0, 1, 2]
        zoom: 8
        resolution: r25
        frequency: daily
        stat: std

The data section uses a three-level nested structure:

Model level: Top-level key for each model (e.g., ICON, IFS-NEMO)
Experiment level: Second-level key for each experiment (e.g., historical-1990)
Source level: Third-level key for each data source (e.g., hourly-hpz10-atm2d)

Each source configuration supports the following parameters:

vars (list, required): List of variable short names to process.
- Example: ['2t', 'tprate', 'msl']
workers (int, optional): Number of Dask workers for parallel processing. Default: 1
- Typical range: 4-16 depending on available memory and vertical levels
- 1 worker disables parallel processing
realizations (list, optional): Specific ensemble members to process.
- Example: [0, 1, 2] processes r0, r1, and r2
- If omitted, processes the default realization (r1)
- Only applicable to ensemble datasets
zoom (int, optional): Zoom level for HEALPix sources (e.g., zoom: 8). Passed directly to the Reader.
resolution (string, optional): Override target resolution for this specific source.
frequency (string, optional): Override target frequency for this specific source.
stat (string, optional): Override statistical operator for this specific source.

Example: Multiple Models and Configurations

data:
  ICON:
    historical-1990:
      hourly-hpz10-atm2d:
        vars: ['2t', 'tp', 'msl']
        workers: 12
        resolution: r100
        frequency: daily
        stat: mean

      daily-hpz10-oce2d:
        vars: ['avg_sithick', 'avg_siconc']
        workers: 16
        frequency: monthly

  IFS-NEMO:
    historical-1950:
      daily:
        vars: ['2t', 'tp']
        workers: 8
        stat: max
        region:
          name: Europe
          lat: [35, 70]
          lon: [-10, 40]

This configuration will process:

ICON historical-1990 atmospheric variables at daily/r100 resolution
ICON historical-1990 ocean variables at monthly frequency
IFS-NEMO historical-1950 daily maximum values for European region

Configuration Precedence

When the same parameter appears at multiple levels, the precedence order is:

Command-line arguments (highest priority)
Source-level settings in the data section
Target-level settings in the target section (lowest priority)

This allows you to set global defaults in target and override them for specific sources or via command line.

Usage

aqua drop <options>

Options: these override the configuration file options.

Note

The configuration file (-c/--config) is optional. When omitted, DROP runs in CLI-only mode and requires --outdir, --model, --exp, --source, and --var to be provided on the command line. Parameters that accept complex structures (region, stat_kwargs, compact, exclude_incomplete) are available only through the configuration file.

-c CONFIG, --config CONFIG: Set up a specific configuration file

-d, --definitive: Run the code and produce the data (a dry-run will take place if this flag is missing)

-f, --fix: Set up the Reader fixing capabilities (default: True)

-w, --workers: Set up the number of dask workers (default: 1, i.e. dask disabled)

-l, --loglevel: Set up the logging level.

-o, --overwrite: Overwrite existing data (default: WARNING).

--monitoring: Enable a single chunk run to produce the html dask performance report. Dask should be activated.

--catalog-entry {yes,no,only}

Controls catalog entry behaviour (default: yes):

yes: write data and create/update the catalog entry (default behaviour).
no: write data but skip catalog creation (useful when catalog management is handled separately or for icechunk-based runs).
only: skip data writing and only create/update the catalog entry (replaces the former --only-catalog flag).

--rebuild: This option will force the rebuilding of the areas and weights files for the regridding. If multiple variables or members are present in the configuration, this will be done only once.

--stat: Statistic to be computed (default: ‘mean’). Options: ‘mean’, ‘std’, ‘max’, ‘min’, ‘sum’, ‘histogram’.

--frequency: Frequency of the DROP output (default: as the original data)

--resolution: Resolution of the DROP output (default: as the original data)

--regrid_first: Whether to apply regridding (and region selection) before time statistics (default: False) For some statistics (e.g., histogram), it may be necessary to regrid the data before applying the statistic because the statistic can disrupt the spatial dimensions required for regridding.

--realization: Which realization (e.g. ensemble member) to use for the DROP output (default: ‘r1’)

--startdate: Start date for the DROP output (default: as the original data). Accepted format: ‘YYYY-MM-DDT00:00:00’

--enddate: End date for the DROP output (default: as the original data). Accepted format: ‘YYYY-MM-DDT23:00:00’

--level: Vertical levels to select. Default: None Can be a single level (int or float) or a list of levels separated by commas (e.g. ‘1000,850,500’). If specified, only these levels will be processed. If omitted, all levels are included.

--engine: The engine used for the GSV retrieval, options are ‘fdb’ (default) and ‘polytope’.

--driver: Output format for DROP files: netcdf (default), zarr, or icechunk. icechunk is experimental and does not support catalog integration.

--outdir OUTDIR: Output directory. Overrides paths.outdir from the configuration file. Required when running without a configuration file.

--tmpdir TMPDIR: Temporary directory. Overrides paths.tmpdir from the configuration file. Falls back to --outdir when not specified.

--no-validate: Skip the pre-run integrity check on existing output files. Speeds up startup for large datasets.

--catalog CATALOG: Catalog to process. Use together with --model, --exp and --source to narrow the run to a specific triplet instead of looping over the full data section.

-m MODEL, --model MODEL: Model to process. Use together with --exp and --source.

-e EXP, --exp EXP: Experiment to process. Use together with --model and --source.

-s SOURCE, --source SOURCE: Source to process. Use together with --model and --exp.

-v VAR, --var VAR: Single variable to process. Use together with --source.

Examples:

Process data to create monthly 1° resolution output:

aqua drop -c drop_config.yaml -d -w 4

Generate daily data at 0.25° resolution with 8 workers:

aqua drop -c drop_config.yaml -d -w 8 --resolution r25 --frequency daily

Warning

Keep in mind that this script is ideally submitted via batch to a HPC node, so that a template for SLURM is also available in the same directory (.aqua/templates/drop/drop-submitter.tmpl). Be aware that although the computation is split among different months, the memory consumption of loading very big data is a limiting factor, so that unless you have very fat node it is unlikely you can use more than 16 workers.

Output:

After processing, new catalog entries are automatically created following the naming convention described above, allowing immediate access to your processed data.

Parallel DROP tool

Using DROP can be a memory-intensive task, that cannot be easily parallelized within a single job. For processing multiple variables or large datasets, use the parallel execution script cli_drop_parallel_slurm.py to submit multiple SLURM jobs simultaneously:

./cli_drop_parallel_slurm.py -c drop_config.yaml -d -w 4 -p 4

This processes data using 4 workers per node with up to 4 concurrent SLURM jobs. It builds on Jinja2 template replacement from a typical SLURM script aqua_drop.j2. For now it is configured only to be run on LUMI but further development should allow for larger portability.

A -s option to call the run via container instead of using the local installation.

Warning

Use with caution. This script rapidly submits tens of jobs to the SLURM scheduler!