DROP - Data Reduction OPerator
DROP (Data Reduction OPerator) is a comprehensive tool within the AQUA framework designed to extract, process, and organize data from any climate dataset.
What is DROP?
DROP is a comprehensive data reduction operator that combines the regridding, fixing, and time
averaging capabilities included in AQUA. The Drop class uses dask to exploit parallel
computations and can process any supported dataset, and it serves as a general-purpose data
reduction platform.
DROP Capabilities
DROP’s architecture enables various data processing tasks:
Temporal Processing:
Custom frequency resampling (any frequency to any frequency)
Multiple statistics: mean, std, max, min, sum or histogram
Handling of incomplete time chunks
Support for kwargs to specify arguments of callable statistics (e.g. histogram bins and range)
Spatial Processing:
Regridding to any supported resolution or native grid
Regional data extraction with configurable boundaries
Support for both regular and irregular grids
Data Management:
Automatic catalog entry generation for DROP-generated outputs
Output in NetCDF, Zarr, and icechunk formats for flexible access
Parallel processing with configurable workers
Memory-efficient chunked processing
Example use cases:
Extract daily European data from global monthly archives
Convert model output from native grid to regular 0.25° grid
Create statistical summaries (std, max, min) instead of just means
Process specific ensemble members or realizations
DROP can be explored in the DROP notebook.
Note
DROP is designed to be flexible and can be used for a wide range of data reduction tasks beyond the specific use cases mentioned above. However, the processing window and output file are always based on monthly chunks.
The Low Resolution Archive (LRA) Context
The Low Resolution Archive is a key use case for DROP. The LRA is an intermediate layer of data reduction that simplifies analysis of extreme high-resolution data by providing monthly data 1 degree resolution, permitting reduced storage and computational requirements.
Note
LRA built available on Levante and Lumi by AQUA team are all at r100 (i.e. 1 deg
resolution) and at monthly frequency. The corresponding catalog entry name is
lra-r100-monthly.
Source Naming Convention
DROP automatically generates catalog source names following a consistent pattern:
Standard naming format:
Pattern:
{resolution}-{frequency}Examples:
r100-monthly(1° resolution, monthly frequency)r100-daily(1° resolution, daily frequency)r25-monthly(0.25° resolution, monthly frequency)
Default LRA source:
lra-r100-monthly
Zarr variants:
All sources have corresponding Zarr reference versions with -zarr suffix:
r100-monthly-zarrr100-daily-zarr
Resolution codes:
r100= 1° (100km approximately)r25= 0.25° (25km approximately)native= original model grid
Parameter-based access: Different processing options can be accessed via Reader kwargs:
# Access specific statistics (if generated)
reader = Reader(model="IFS-NEMO", exp="historical-1990",
source="r100-monthly", stat="std")
# Access regional data (if generated)
reader = Reader(model="IFS-NEMO", exp="historical-1990",
source="r100-monthly", region="europe")
# Access specific ensemble realizations
reader = Reader(model="IFS-NEMO", exp="historical-1990",
source="r100-monthly", realization="r2")
Accessing DROP-generated data
Once DROP has processed the data, generated outputs can be accessed via the standard Reader
interface using the automatically created catalog sources.
from aqua import Reader
reader = Reader(model="IFS-NEMO", exp="historical-1990", source="lra-r100-monthly")
data = reader.retrieve()
Advanced access patterns:
# Access standard deviation instead of mean
reader = Reader(model="ERA5", exp="era5", source="r100-monthly", stat="std")
std_data = reader.retrieve()
# Access regional European data
reader = Reader(model="IFS-NEMO", exp="historical-1990",
source="r25-daily", region="europe")
eu_data = reader.retrieve()
# Access specific ensemble member
reader = Reader(model="IFS-NEMO", exp="historical-1990",
source="r100-daily", realization="r3")
member_data = reader.retrieve()
Zarr access for faster performance:
You can access data using Zarr stores files for improved performance, when available:
# Faster access using Zarr references
reader = Reader(model="IFS-NEMO", exp="historical-1990", source="lra-r100-monthly-zarr")
data = reader.retrieve()
Note
The specific source names depend on the resolution and frequency you configured when running DROP. See the “Source Naming Convention” section above for details.
Using DROP to process data
DROP processes data through a command line interface (CLI) available with the subcommand aqua drop.
Configuration is done via a YAML file that can be built from the drop_config.tmpl,
available in the .aqua/templates/drop folder after installation. The configuration
file allows you to specify:
Target resolution and frequency
Variables to process
Regional boundaries (optional)
Output and temporary directories
SLURM options and number of workers
Configuration structure:
The configuration follows the model-exp-source 3-level hierarchy in the data dictionary.
Key configuration options include:
vars: variables to processresolution: target spatial resolution (e.g.,r100,r25,native)frequency: target temporal frequency (e.g.,monthly,daily,3hourly)stat: statistic to compute (mean,std,max,min)region: spatial subsetting configurationengine: The engine used for the GSV retrieval, options are ‘fdb’ and ‘polytope’.
Warning
Catalog detection is automatic, but specify the catalog name explicitly in the configuration file if you have identically named triplets in different catalogs.
Configuration File
The DROP configuration file is structured in YAML format with four main sections: target,
paths, options, slurm, and data. Below is a detailed explanation of each
configuration parameter.
Target Section
The target section defines the primary output characteristics for the DROP processing:
target:
resolution: r100
frequency: monthly
catalog: my_catalog
startdate: "2020-01-01T00:00:00"
enddate: "2020-12-31T23:00:00"
region:
name: Europe
lat: [35, 70]
lon: [-10, 40]
stat: mean
stat_kwargs: {}
resolution (string, required): Target spatial resolution for regridding.
r100: 1° resolution (~100km) or any other supported target grid (see Available target grids)native: Keep original model grid (no regridding)
frequency (string, required): Target temporal frequency for output.
monthly,daily,3hourly,6hourly,hourlyAny valid frequency string supported by pandas
resampleIf not specified, keeps original data frequency
catalog (string, optional): Name of the catalog to process.
It will be used for all the models listed in the
datasection.
startdate (string, optional): Starting date for data processing.
Format:
YYYY-MM-DDor any valid date string parsable by pandasExample:
"2020-01-01"If omitted, processes from the first available date
enddate (string, optional): Ending date for data processing.
Format:
YYYY-MM-DDor any valid date string parsable by pandasExample:
"2020-12-31"If omitted, processes until the last available date
region (dict, optional): Spatial subsetting configuration. If omitted, processes global data.
name (string): Region identifier (e.g.,
Europe,Tropics)lat (list): Latitude range as
[min, max](e.g.,[35, 70])lon (list): Longitude range as
[min, max](e.g.,[-10, 40])
stat (string, optional): Statistical operator for temporal aggregation. Default:
meanmean: Arithmetic meanstd: Standard deviationmax: Maximum valuemin: Minimum valuesum: Sum of valueshistogram: Compute histogram (requiresstat_kwargsto specify the range argument)
stat_kwargs (dict, optional): Additional arguments for the statistical function. Default:
{}For
histograme.g.:{bins: 20, range: [0, 100]}Empty dict or missing line for other statistics that don’t require additional arguments
Paths Section
Defines the directory structure for outputs and temporary files:
paths:
outdir: /path/to/output
tmpdir: /path/to/tmp
outdir (string, required): Directory where final DROP outputs will be stored.
Should have sufficient space for processed data
Subdirectories are automatically created based on catalog/model/exp/source hierarchy
tmpdir (string, required): Directory for temporary files during processing.
Must be on fast storage (ideally local to compute node)
Should have space for intermediate monthly files and aggregated yearly files
Options Section
Controls processing behavior and performance settings:
options:
engine: fdb
loglevel: INFO
driver: netcdf
overwrite: False
rebuild: False
compact: cdo
performance_reporting: False
engine (string, optional): Data retrieval engine. Default:
fdbneeded only for GSV retrieval, options are ‘fdb’ and ‘polytope’
fdb: Fields DataBase, you should be on the same machine where the database is locatedpolytope: Polytope service (remote access). Be sure to have the correct credentials and network access to use this option.
loglevel (string, optional): Logging verbosity. Default:
WARNINGAvailable levels:
DEBUG,INFO,WARNING,ERROR
driver (string, optional): Format for the output files. Default:
netcdfnetcdf: Create NetCDF files. Monthly files are always created, but ifcompactis set toxarrayorcdo(see below), they will be concatenated into yearly files and the monthly files will be deleted.zarr: Create Zarr datasets files for faster subsequent access. Test feature under development, use with caution. Monthly files are created and then concatenated into yearly consolidate files, and monthly files are removed. This is suboptimal but provides safety against incomplete or corrupted files.icechunk: Write all data into a single git-like versioned Zarr repository using icechunk. Every month is committed as an atomic snapshot; failed writes are automatically rolled back to the last clean commit. A post-commit integrity check is performed after each month.
Warning
Experimental feature.
icechunkoutput is experimental and not compatible with AQUA catalog integration. Runningaqua dropwith--driver icechunk(ordriver: icechunkin the config file) will skip the automaticcreate_catalog_entrystep, meaning the output cannot be accessed viaReaderusing a catalog source name. Direct access viaicechunk.Repository.openandxr.open_zarris required. Do not use in production pipelines until this limitation is resolved.overwrite (bool, optional): Overwrite existing output files. Default:
FalseTrue: Replace existing filesFalse: Skip processing if files existDROP checks if the existing files are complete before skipping, so it won’t skip if files are incomplete or corrupted
exclude_incomplete (bool, optional): Exclude incomplete temporal chunks. Default:
FalseTrue: Drop months/periods with missing dataFalse: Process all available data
rebuild (bool, optional): Force rebuilding of regridding weights. Default:
FalseTrue: Regenerate area and weight filesFalse: Use cached weights if availableSet to
Trueif you suspect weights are outdated (e.g., after a major update to CDO or AQUA)
compact (string, optional): Method for concatenating monthly files into yearly files. Only relevant when
driver: netcdf. Default:cdoxarray: Use xarray for concatenationcdo: Use Climate Data Operatorsnullor omit: No compacting, keep monthly files
performance_reporting (bool, optional): Generate Dask performance HTML report. Default:
FalseTrue: Create detailed performance report for one chunk. Then the job will stop.False: No performance monitoring
SLURM Section
Configuration for HPC job submission (used by parallel DROP tools):
slurm:
partition: standard
username: myuser
account: myproject
time: "02:00:00"
mem: "64GB"
partition (string): SLURM partition name (e.g.,
standard,compute,large-mem)username (string): Your HPC username
account (string): Project or account name for billing
time (string): Maximum wall time (format:
HH:MM:SS)mem (string): Memory allocation per job (e.g.,
64GB,128GB)
Data Section
Defines the hierarchical structure of data to process. They have all to be inside the same catalog specified in the target section.
data:
MODEL_NAME:
EXPERIMENT_NAME:
SOURCE_NAME:
vars: ['var1', 'var2', 'var3']
workers: 12
realizations: [0, 1, 2]
zoom: 8
resolution: r25
frequency: daily
stat: std
The data section uses a three-level nested structure:
Model level: Top-level key for each model (e.g.,
ICON,IFS-NEMO)Experiment level: Second-level key for each experiment (e.g.,
historical-1990)Source level: Third-level key for each data source (e.g.,
hourly-hpz10-atm2d)
Each source configuration supports the following parameters:
vars (list, required): List of variable short names to process.
Example:
['2t', 'tprate', 'msl']
workers (int, optional): Number of Dask workers for parallel processing. Default: 1
Typical range: 4-16 depending on available memory and vertical levels
1 worker disables parallel processing
realizations (list, optional): Specific ensemble members to process.
Example:
[0, 1, 2]processes r0, r1, and r2If omitted, processes the default realization (r1)
Only applicable to ensemble datasets
zoom (int, optional): Zoom level for HEALPix sources (e.g.,
zoom: 8). Passed directly to the Reader.resolution (string, optional): Override target resolution for this specific source.
frequency (string, optional): Override target frequency for this specific source.
stat (string, optional): Override statistical operator for this specific source.
Example: Multiple Models and Configurations
data:
ICON:
historical-1990:
hourly-hpz10-atm2d:
vars: ['2t', 'tp', 'msl']
workers: 12
resolution: r100
frequency: daily
stat: mean
daily-hpz10-oce2d:
vars: ['avg_sithick', 'avg_siconc']
workers: 16
frequency: monthly
IFS-NEMO:
historical-1950:
daily:
vars: ['2t', 'tp']
workers: 8
stat: max
region:
name: Europe
lat: [35, 70]
lon: [-10, 40]
This configuration will process:
ICON historical-1990 atmospheric variables at daily/r100 resolution
ICON historical-1990 ocean variables at monthly frequency
IFS-NEMO historical-1950 daily maximum values for European region
Configuration Precedence
When the same parameter appears at multiple levels, the precedence order is:
Command-line arguments (highest priority)
Source-level settings in the
datasectionTarget-level settings in the
targetsection (lowest priority)
This allows you to set global defaults in target and override them for specific
sources or via command line.
Usage
aqua drop <options>
Options: these override the configuration file options.
Note
The configuration file (-c/--config) is optional. When omitted, DROP runs in CLI-only mode
and requires --outdir, --model, --exp, --source, and --var to be provided
on the command line. Parameters that accept complex structures (region, stat_kwargs,
compact, exclude_incomplete) are available only through the configuration file.
- -c CONFIG, --config CONFIG
Set up a specific configuration file
- -d, --definitive
Run the code and produce the data (a dry-run will take place if this flag is missing)
- -f, --fix
Set up the Reader fixing capabilities (default: True)
- -w, --workers
Set up the number of dask workers (default: 1, i.e. dask disabled)
- -l, --loglevel
Set up the logging level.
- -o, --overwrite
Overwrite existing data (default: WARNING).
- --monitoring
Enable a single chunk run to produce the html dask performance report. Dask should be activated.
- --catalog-entry {yes,no,only}
Controls catalog entry behaviour (default:
yes):yes: write data and create/update the catalog entry (default behaviour).no: write data but skip catalog creation (useful when catalog management is handled separately or for icechunk-based runs).only: skip data writing and only create/update the catalog entry (replaces the former--only-catalogflag).
- --rebuild
This option will force the rebuilding of the areas and weights files for the regridding. If multiple variables or members are present in the configuration, this will be done only once.
- --stat
Statistic to be computed (default: ‘mean’). Options: ‘mean’, ‘std’, ‘max’, ‘min’, ‘sum’, ‘histogram’.
- --frequency
Frequency of the DROP output (default: as the original data)
- --resolution
Resolution of the DROP output (default: as the original data)
- --realization
Which realization (e.g. ensemble member) to use for the DROP output (default: ‘r1’)
- --startdate
Start date for the DROP output (default: as the original data). Accepted format: ‘YYYY-MM-DDT00:00:00’
- --enddate
End date for the DROP output (default: as the original data). Accepted format: ‘YYYY-MM-DDT23:00:00’
- --engine
The engine used for the GSV retrieval, options are ‘fdb’ (default) and ‘polytope’.
- --driver
Output format for DROP files:
netcdf(default),zarr, oricechunk.icechunkis experimental and does not support catalog integration.
- --outdir OUTDIR
Output directory. Overrides
paths.outdirfrom the configuration file. Required when running without a configuration file.
- --tmpdir TMPDIR
Temporary directory. Overrides
paths.tmpdirfrom the configuration file. Falls back to--outdirwhen not specified.
- --no-validate
Skip the pre-run integrity check on existing output files. Speeds up startup for large datasets.
- --catalog CATALOG
Catalog to process. Use together with
--model,--expand--sourceto narrow the run to a specific triplet instead of looping over the fulldatasection.
- -m MODEL, --model MODEL
Model to process. Use together with
--expand--source.
- -e EXP, --exp EXP
Experiment to process. Use together with
--modeland--source.
- -s SOURCE, --source SOURCE
Source to process. Use together with
--modeland--exp.
- -v VAR, --var VAR
Single variable to process. Use together with
--source.
Examples:
Process data to create monthly 1° resolution output:
aqua drop -c drop_config.yaml -d -w 4
Generate daily data at 0.25° resolution with 8 workers:
aqua drop -c drop_config.yaml -d -w 8 --resolution r25 --frequency daily
Warning
Keep in mind that this script is ideally submitted via batch to a HPC node,
so that a template for SLURM is also available in the same directory (.aqua/templates/drop/drop-submitter.tmpl).
Be aware that although the computation is split among different months, the memory consumption of loading very big data
is a limiting factor, so that unless you have very fat node it is unlikely you can use more than 16 workers.
Output:
After processing, new catalog entries are automatically created following the naming convention described above, allowing immediate access to your processed data.
Parallel DROP tool
Using DROP can be a memory-intensive task, that cannot be easily parallelized within a single job.
For processing multiple variables or large datasets, use the parallel execution script
cli_drop_parallel_slurm.py to submit multiple SLURM jobs simultaneously:
./cli_drop_parallel_slurm.py -c drop_config.yaml -d -w 4 -p 4
This processes data using 4 workers per node with up to 4 concurrent SLURM jobs. It builds on Jinja2 template replacement from a typical SLURM script aqua_drop.j2. For now it is configured only to be run on LUMI but further development should allow for larger portability.
A -s option to call the run via container instead of using the local installation.
Warning
Use with caution. This script rapidly submits tens of jobs to the SLURM scheduler!