Skip to content

Auxiliary Data and Datacubes

DARTS uses several auxiliary data - data which does not change between different scenes and / or time steps. Raster auxiliary data is stored in Zarr Datacubes.

Currently, the following auxiliary data is used:

  • ArcticDEM
  • Tasseled Cap indices (Brightness, Greenness, Wetness)

with more to come.

ArcticDEM

The ArcticDEM is downloaded via their STAC server using these extend files.

The user can specify the download directory, where the ArcticDEM will be procedurally stored in a Zarr Datacube. The user can also specify the resolution of the ArcticDEM, which is either 2m, 10m or 32m. Each resolution is stored in their own Zarr Datacube.

darts_acquisition.load_arcticdem(geobox, data_dir, resolution, buffer=0, persist=True)

Load the ArcticDEM for the given geobox, fetch new data from the STAC server if necessary.

Parameters:

Name Type Description Default
geobox GeoBox

The geobox for which the tile should be loaded.

required
data_dir Path | str

The directory where the ArcticDEM data is stored.

required
resolution Literal[2, 10, 32]

The resolution of the ArcticDEM data in m.

required
buffer int

The buffer around the projected (epsg:3413) geobox in pixels. Defaults to 0.

0
persist bool

If the data should be persisted in memory. If not, this will return a Dask backed Dataset. Defaults to True.

True

Returns:

Type Description
Dataset

xr.Dataset: The ArcticDEM tile, with a buffer applied. Note: The buffer is applied in the arcticdem dataset's CRS, hence the orientation might be different. Final dataset is NOT matched to the reference CRS and resolution.

Warning

Geobox must be in a meter based CRS.

Usage

Since the API of the load_arcticdem is based on GeoBox, one can load a specific ROI based on an existing Xarray DataArray:

import xarray as xr
import odc.geo.xr

from darts_aquisition import load_arcticdem

# Assume "optical" is an already loaded s2 based dataarray

arcticdem = load_arcticdem(
    optical.odc.geobox,
    "/path/to/arcticdem-parent-directory",
    resolution=2,
    buffer=ceil(self.tpi_outer_radius / 2 * sqrt(2))
)

# Now we can for example match the resolution and extent of the optical data:
arcticdem = arcticdem.odc.reproject(optical.odc.geobox, resampling="cubic")

The buffer parameter is used to extend the region of interest by a certain amount of pixels. This comes handy when calculating e.g. the Topographic Position Index (TPI), which requires a buffer around the region of interest to remove edge effects.

Source code in darts-acquisition/src/darts_acquisition/arcticdem/datacube.py
def load_arcticdem(
    geobox: GeoBox,
    data_dir: Path | str,
    resolution: RESOLUTIONS,
    buffer: int = 0,
    persist: bool = True,
) -> xr.Dataset:
    """Load the ArcticDEM for the given geobox, fetch new data from the STAC server if necessary.

    Args:
        geobox (GeoBox): The geobox for which the tile should be loaded.
        data_dir (Path | str): The directory where the ArcticDEM data is stored.
        resolution (Literal[2, 10, 32]): The resolution of the ArcticDEM data in m.
        buffer (int, optional): The buffer around the projected (epsg:3413) geobox in pixels. Defaults to 0.
        persist (bool, optional): If the data should be persisted in memory.
            If not, this will return a Dask backed Dataset. Defaults to True.

    Returns:
        xr.Dataset: The ArcticDEM tile, with a buffer applied.
            Note: The buffer is applied in the arcticdem dataset's CRS, hence the orientation might be different.
            Final dataset is NOT matched to the reference CRS and resolution.

    Warning:
        Geobox must be in a meter based CRS.

    Usage:
        Since the API of the `load_arcticdem` is based on GeoBox, one can load a specific ROI based on an existing Xarray DataArray:

        ```python
        import xarray as xr
        import odc.geo.xr

        from darts_aquisition import load_arcticdem

        # Assume "optical" is an already loaded s2 based dataarray

        arcticdem = load_arcticdem(
            optical.odc.geobox,
            "/path/to/arcticdem-parent-directory",
            resolution=2,
            buffer=ceil(self.tpi_outer_radius / 2 * sqrt(2))
        )

        # Now we can for example match the resolution and extent of the optical data:
        arcticdem = arcticdem.odc.reproject(optical.odc.geobox, resampling="cubic")
        ```

        The `buffer` parameter is used to extend the region of interest by a certain amount of pixels.
        This comes handy when calculating e.g. the Topographic Position Index (TPI), which requires a buffer around the region of interest to remove edge effects.

    """  # noqa: E501
    tick_fstart = time.perf_counter()

    data_dir = Path(data_dir) if isinstance(data_dir, str) else data_dir

    datacube_fpath = data_dir / f"datacube_{resolution}m_v4.1.zarr"
    storage = zarr.storage.FSStore(datacube_fpath)
    logger.debug(f"Getting ArcticDEM tile from {datacube_fpath.resolve()}")

    # ! The geobox must be in a meter based CRS
    logger.debug(f"Found a reference resolution of {geobox.resolution.x}m")

    # Check if the zarr data already exists
    if not datacube_fpath.exists():
        logger.debug(f"Creating a new zarr datacube at {datacube_fpath.resolve()} with {storage=}")
        create_empty_datacube(
            "ArcticDEM Data Cube",
            storage,
            DATA_EXTENT[resolution],
            CHUNK_SIZE,
            DATA_VARS,
            DATA_VARS_META,
            DATA_VARS_ENCODING,
        )

    # Get the adjacent arcticdem tiles
    # Note: We could also use pystac here, but this would result in a slight performance decrease
    # because of the network overhead, hence we use the extent file
    # Download the extent, download if the file does not exist
    extent_fpath = data_dir / f"ArcticDEM_Mosaic_Index_v4_1_{resolution}m.parquet"
    with download_lock:
        if not extent_fpath.exists():
            download_arcticdem_extent(data_dir)
    extent = gpd.read_parquet(extent_fpath)

    # Add a buffer around the geobox to get the adjacent tiles
    reference_geobox = geobox.to_crs("epsg:3413", resolution=resolution).pad(buffer)
    adjacent_tiles = extent[extent.intersects(reference_geobox.extent.geom)]

    # Download the adjacent tiles (if necessary)
    with download_lock:
        procedural_download_datacube(storage, adjacent_tiles)

    # Load the datacube and set the spatial_ref since it is set as a coordinate within the zarr format
    chunks = None if persist else "auto"
    arcticdem_datacube = xr.open_zarr(storage, mask_and_scale=False, chunks=chunks).set_coords("spatial_ref")

    # Get an AOI slice of the datacube
    arcticdem_aoi = arcticdem_datacube.odc.crop(reference_geobox.extent, apply_mask=False)

    # The following code would load the lazy zarr data from disk into memory
    if persist:
        tick_sload = time.perf_counter()
        arcticdem_aoi = arcticdem_aoi.load()
        tick_eload = time.perf_counter()
        logger.debug(f"ArcticDEM AOI loaded from disk in {tick_eload - tick_sload:.2f} seconds")

    # Change dtype of the datamask to uint8 for later reproject_match
    arcticdem_aoi["datamask"] = arcticdem_aoi.datamask.astype("uint8")

    logger.info(
        f"ArcticDEM tile {'loaded' if persist else 'lazy-opened'} in {time.perf_counter() - tick_fstart:.2f} seconds"
    )
    return arcticdem_aoi

Tasseled Cap indices (TCVIS)

The TCVIS data is downloaded from Google Earth-Engine (GEE) using the TCVIS collection from Ingmar Nitze: "users/ingmarnitze/TCTrend_SR_2000-2019_TCVIS".

darts_acquisition.load_tcvis(geobox, data_dir, buffer=0, persist=True)

Load the TCVIS for the given geobox, fetch new data from GEE if necessary.

Parameters:

Name Type Description Default
geobox GeoBox

The geobox to load the data for.

required
data_dir Path | str

The directory to store the downloaded data for faster access for consecutive calls.

required
buffer int

The buffer around the geobox in pixels. Defaults to 0.

0
persist bool

If the data should be persisted in memory. If not, this will return a Dask backed Dataset. Defaults to True.

True

Returns:

Type Description
Dataset

xr.Dataset: The TCVIS dataset.

Usage

Since the API of the load_tcvis is based on GeoBox, one can load a specific ROI based on an existing Xarray DataArray:

import xarray as xr
import odc.geo.xr

from darts_aquisition import load_tcvis

# Assume "optical" is an already loaded s2 based dataarray

tcvis = load_tcvis(
    optical.odc.geobox,
    "/path/to/tcvis-parent-directory",
)

# Now we can for example match the resolution and extent of the optical data:
tcvis = tcvis.odc.reproject(optical.odc.geobox, resampling="cubic")
Source code in darts-acquisition/src/darts_acquisition/tcvis.py
def load_tcvis(
    geobox: GeoBox,
    data_dir: Path | str,
    buffer: int = 0,
    persist: bool = True,
) -> xr.Dataset:
    """Load the TCVIS for the given geobox, fetch new data from GEE if necessary.

    Args:
        geobox (GeoBox): The geobox to load the data for.
        data_dir (Path | str): The directory to store the downloaded data for faster access for consecutive calls.
        buffer (int, optional): The buffer around the geobox in pixels. Defaults to 0.
        persist (bool, optional): If the data should be persisted in memory.
            If not, this will return a Dask backed Dataset. Defaults to True.

    Returns:
        xr.Dataset: The TCVIS dataset.

    Usage:
        Since the API of the `load_tcvis` is based on GeoBox, one can load a specific ROI based on an existing Xarray DataArray:

        ```python
        import xarray as xr
        import odc.geo.xr

        from darts_aquisition import load_tcvis

        # Assume "optical" is an already loaded s2 based dataarray

        tcvis = load_tcvis(
            optical.odc.geobox,
            "/path/to/tcvis-parent-directory",
        )

        # Now we can for example match the resolution and extent of the optical data:
        tcvis = tcvis.odc.reproject(optical.odc.geobox, resampling="cubic")
        ```

    """  # noqa: E501
    tick_fstart = time.perf_counter()

    data_dir = Path(data_dir) if isinstance(data_dir, str) else data_dir

    datacube_fpath = data_dir / "tcvis_2000-2019.zarr"
    storage = zarr.storage.FSStore(datacube_fpath)
    logger.debug(f"Loading TCVis from {datacube_fpath.resolve()}")

    if not datacube_fpath.exists():
        logger.debug(f"Creating a new zarr datacube at {datacube_fpath.resolve()} with {storage=}")
        create_empty_datacube(
            title="Landsat Trends TCVIS 2000-2019",
            storage=storage,
            geobox=DATA_EXTENT,
            chunk_size=CHUNK_SIZE,
            data_vars=DATA_VARS,
            meta=DATA_VARS_META,
            var_encoding=DATA_VARS_ENCODING,
        )

    # Download the adjacent tiles (if necessary)
    reference_geobox = geobox.to_crs("epsg:4326", resolution=DATA_EXTENT.resolution.x).pad(buffer)
    with download_lock:
        procedural_download_datacube(storage, reference_geobox)

    # Load the datacube and set the spatial_ref since it is set as a coordinate within the zarr format
    chunks = None if persist else "auto"
    tcvis_datacube = xr.open_zarr(storage, mask_and_scale=False, chunks=chunks).set_coords("spatial_ref")

    # Get an AOI slice of the datacube
    tcvis_aoi = tcvis_datacube.odc.crop(reference_geobox.extent, apply_mask=False)

    # The following code would load the lazy zarr data from disk into memory
    if persist:
        tick_sload = time.perf_counter()
        tcvis_aoi = tcvis_aoi.load()
        tick_eload = time.perf_counter()
        logger.debug(f"TCVIS AOI loaded from disk in {tick_eload - tick_sload:.2f} seconds")

    logger.info(
        f"TCVIS tile {'loaded' if persist else 'lazy-opened'} in {time.perf_counter() - tick_fstart:.2f} seconds"
    )
    return tcvis_aoi

Why Zarr Datacubes?

Zarr is a file format for storing chunked, compressed, N-dimensional arrays. It is designed to store large arrays of data, and to facilitate fast and efficient IO. Zarr works well integrated with Dask and Xarray.

By storing the auxiliary data in Zarr Datacubes, it is much easier and faster to access the data of interest. If we would use GeoTiffs, we would have to first create a Cloud-Optimized GeoTiff (COG), which is basically an ensemble (mosaic) of multiple GeoTiffs. Then we would have to read from the COG, which behind the scenes would open multiple GeoTiffs and crops them to fit the region of interest. E.g. Opening a specific region of interest 10km x 10km from a 2m resolution COG would take up to 2 minutes, if the COGs extend is panarctic. Opening the same region from a Zarr Datacube takes less than 1 second.

Inspiration

This implementation and concept is heavily inspired by EarthMovers implementation of serverless datacube generation.

Procedural download

Info

The currently used auxiliary data is downloaded on demand, only data actually used is downloaded and stored on your local machine. Hence, the stored datacubes can be thought of as a cache, which is filled with data as needed.

There are currently two implementations of the procedural download used: a cloud based STAC download and a download via Google Earth-Engine.

Because the single tiles of the STAC mosaic can be overlapping and intersect with multiple Zarr chunks, the STAC download is slightly more complicated. Since Google Earth-Engine allows for exact geoboxes, download of the exact chunks is possible. This reduces the complexity of the download.

STAC GEE
1. ROI STAC 1. ROI download GEE 1. ROI download
2. ROI STAC 2. ROI download GEE 2. ROI download

The above graphics shows the difference between loading data from STAC (left) and Google Earth-Engine (right). With the STAC download, the data is downloaded from a mosaic of tiles, which can be overlapping with each other and cover multiple Zarr chunks. It may occur that a chunk is not fully covered by the STAC mosaic, which results in only partial loaded chunks. In such cases, the missing data in these chunks will be updated if the other intersecting tile is downloaded, which may occur to a later time if a connected ROI is requested. The download process is much easier for GEE, since one can request the exact geoboxes of the Zarr chunks and GEE will handle the rest. Hence, chunks will always be fully covered by the downloaded data.

Regarding the open ROI process, both implementations follow the same principle:

  1. Check which Tiles / Chunks intersect with the region of interest
  2. Dowload all new Tiles / Chunks
  3. Store the new Tiles / Chunks in their specific Zarr chunks
  4. Return the region of interest of the Zarr Datacube

STAC download

ArcticDEM STAC procedural download

Google Earth-Engine download

TCVIS Google Earth-Engine procedural download