Band / Modalities and Normalisation¶

In the training dataset preparation, all bands available from the preprocessing will be included into the train dataset. They are normalised and clipped to the range [0, 1], by hard-coded normalisation factors and offsets. When training, a subset of available bands can be selected, on which the model should be trained. This enables the possibility to quickly test different band combinations and their influence on the model performance without the need to preprocess the data again.

The information about which bands used for training is the written into the model checkpoint. This information is then used to select the bands for inference.

Representations states¶

Split up the data representation into three different representations:

Disk: Data is stored in the most efficient way, e.g. uint16 for Sentinel 2, uint8 for TCVIS
Memory: Data is stored in the most convenient and correct way for visualisation purposes. This should be equal to the original data representation. E.g. [-1, 1] float for NDVI
Model: Data ist normalised to [0, 1] for training and inference and is always float32

Memory representation exeptions

The memory representation is not always the same as the original data representation. For example, Satellite data like Sentinel 2 is originally stored as uint16, however, we also want to account for NaN values in the data. Therefore, the memory representation of Sentinel 2 data is float32 instead, but with the same range as the original data and 0 replaced with NaN.

Specification Details¶

The data convertion happens in three different places of the pipeline:

Disk -> Memory:
Cache-Manager where data is loaded from a cache NetCDF file into memory. This should be handled via xarray, which follows the CF conventions.
Training-Preprocessors
NOT in the export -> The export is handled manually
Memory -> Model: In the segmentation module, right before the data is transformed into PyTorch tensors for inference. For training, this is done before writing the data into the training dataset to save compute power and enable better caching.
Model -> Memory: Never happens, the output propabilities are exported as is. Further, the probabilities are only appended to the data in memory-representation. Therefore the model-representation only ever exists in the inference or training code.
Memory -> Disk: At the export module and with the Cache-Manager when writing cache files. This is done via xarray, which follows the CF conventions.

Terminology

Because the data can have 3 different representations, it becomes unclear what is meant by "encoded" and "decoded". In general, the "Memory" representation is always the "true" and therefore "decoded" representation. However, outside of the context of convertions, we may use a following terminology:

Encoded: The data in the representation that is used for caching and exports, i.e. disk-representation.
Decoded: The data in the representation that is used for working and visualisation, i.e. memory-representation.
Normalised: The data in the representation that is used for training and inference, i.e. model-representation.

DataVariable	usage	shape	dtype (memory)	dtype(disk)	valid-range	disk-range	no-data (disk)	attrs	source	note
`blue`	inp	(x, y)	float32	uint16	[-0.1, 0.5]	[0, 65535]	0	data_source, long_name, units	PLANET / S2
`green`	inp	(x, y)	float32	uint16	[-0.1, 0.5]	[0, 65535]	0	data_source, long_name, units	PLANET / S2
`red`	inp	(x, y)	float32	uint16	[-0.1, 0.5]	[0, 65535]	0	data_source, long_name, units	PLANET / S2
`nir`	inp	(x, y)	float32	uint16	[-0.1, 0.5]	[0, 65535]	0	data_source, long_name, units	PLANET / S2
`s2_scl`	qal	(x, y)	uint8	uint8	[0, 11]	[0, 11]	-	data_source, long_name	S2	https://custom-scripts.sentinel-hub.com/custom-scripts/sentinel-2/scene-classification/
`planet_udm`	qal	(x, y)	uint8	uint8	[0, 8]	[0, 8]	-		PLANET	https://docs.planet.com/data/imagery/udm/
`quality_data_mask`	qal	(x, y)	uint8	uint8	{0, 1, 2}	{0, 1, 2}	-	data_source, long_name, description	Acquisition	0 = Invalid, 1 = Low Quality, 2 = High Quality
`dem`	inp	(x, y)	float32	int16	[-100, 3000]	[0, 31000]	-1	data_source, long_name, units	SmartGeocubes
`arcticdem_data_mask`	qal	(x, y)	uint8	bool	{0, 1}	{False, True}	-	data_source, long_name, units	SmartGeocubes
`tc_brightness`	inp	(x, y)	uint8	uint8	[0, 255]	[0, 255]	-	data_source, long_name	EarthEngine
`tc_greenness`	inp	(x, y)	uint8	uint8	[0, 255]	[0, 255]	-	data_source, long_name	EarthEngine
`tc_wetness`	inp	(x, y)	uint8	uint8	[0, 255]	[0, 255]	-	data_source, long_name	EarthEngine
`ndvi`	inp	(x, y)	float32	int16	[-1, 1]	[0, 20000]	-1	long_name	Preprocessing
`relative_elevation`	inp	(x, y)	float32	int16	[-50, 50]	[0, 30000]	-1	data_source, long_name, units	Preprocessing
`slope`	inp	(x, y)	float32	int16	[0, 90]	[0, 9000]	-1	data_source, long_name	Preprocessing
`aspect`	inp	(x, y)	float32	int16	[0, 360]	[0, 3600]	-1	data_source, long_name	Preprocessing
`hillshade`	inp	(x, y)	float32	int16	[0, 1]	[0, 10000]	-1	data_source, long_name	Preprocessing
`curvature`	inp	(x, y)	float32	int16	[-1, 1]	[0, 20000]	-1	data_source, long_name	Preprocessing
`probabilities`	dbg	(x, y)	float32	uint8	[0, 1]	[0, 100]	255	long_name	Ensemble / Segmentation
`probabilities-X*`	dbg	(x, y)	float32	uint8	[0, 1]	[0, 100]	255	long_name	Ensemble / Segmentation
`binarized_segmentation`	out	(x, y)	bool	bool	{False, True}	{False, True}	-	long_name	Postprocessing
`binarized_segmentation-X*`	dbg	(x, y)	bool	bool	{False, True}	{False, True}	-	long_name	Postprocessing
`extent`	out	(x, y)	bool	bool	{False, True}	{False, True}	-	long_name	Postprocessing

Notes:

X* = Model name, e.g. probabilities-tcvis, probabilities-notcvis, etc.
The no-data value in memory for float32 is always nan.
bool types before postprocessing must be represented as uint8 in memory for easy reprojection etc.
Modes of usage:
inp: (Potential) Input to the model
qal: Quality Assurance Layer, not used as input to the model, but for masking or filtering
dbg: Only exported for debugging purposes
out: Output of the model

Incompleteness

attrs is outdated.

Missing:

New DEM Engineered - VRM DI etc.
New Indices - TGI, EXG, GLI etc.

Loss of Information

Because we encode almost every variable we work with into a smaller sized representation or into a smaller range, information get's lost. E.g. when writing the DEM to disk, values larger than 3000m will be clipped to 3000m and the minimum step size between values reduces to 0.1m. This is enough for our purposes, but may not be suitable for other applications.

Optical bands: PLANET vs. S2-Harmonized (GEE) vs. Sentinel-2 L2A (Copernicus)¶

This is complicated: in theory the range of this data is between 0 and 1 and measured as surface reflectance. However, the values can be negative (e.g. due to atmospheric correction) and larger than 1 (e.g. due to bright surfaces like snow). Thus, Copernicus applies a shift of -0.1 to allow for negative values in the Sentinel-2 L2A product. This was introduced in 2022 - Google Earth Engine just reverts the shift for all data after 2022 in the S2-Harmonized collection, because they never re-upload the data. Once uploaded, the data is fixed and will never change. Thus, GEE S2-Harmonized data is lossy. The unharmonized dataset in GEE, which is officially deprecated, is not lossy, but spectral values are not comparable between years before and after 2022. Further, because data is never re-uploaded, the processing of older imagery is different than newer imagery. The data in GEE and in Copernicus is always stored as uint16 values between 0 and 65535 (maximum of uint16) with a scale factor of 10000, just with different offsets.

In our pipeline we want to be able to utilize negative values in the model. Because most viable (non-snow, non-cloud) values are not larger than 0.5, we decided to use the range [-0.1, 0.5] for the memory representation. Of course, this only applies to the normalization before the data is fed into the model, thus calculating indices like NDVI are not limited to this range.

Data which is directly downloaded from either Copernicus or GEE is directly stored in the cache with their own representation. This is not documented in the table above and is specific to the acquisition module. All data which is output from the acquisition module is always converted to the memory representation.

DEM¶

The highest point in the arctic is approx. 3000m. The lowest depends on the geoid used, for arcticdem there are very few values below -10. (i guess) Hence, the valid-range scaling is similar to the optical data arbitrary.

For TPI (relative_elevation), the valid-range strongly depends on the kernel used. The range increases with larger kernel sizes. E.g. some tests with Sentinel 2:

2px (20m) kernel: [-3, 3]
10px (100m) kernel: [-40, 20]
100px (1000m) kernel: [-60, 40]

Since we use mostly a kernel between 10px and 100px, we can expect the valid range to be between [-50, 50].

Implementation Details¶

All Disk <-> Memory convertion are be done via xarray through their CF convention layer (decode_cf=True) For that, the attributes _FillValue, scale_factor, and add_offset are set by a helper module darts_utils.bands.BandManager. This helper is used by the Cache-Manager and the export module to ensure that the data is always in the correct representation.

_FillValue

With rioxarray it is possible to assign a _FillValue attribute to the data variables with .rio.write_nodata(). This can lead to weird behaviour when writing and reading the data:

>>> "Before writing with _FillValue=0.0: dtype=uint16, attrs={'_FillValue': 0.0, 'data_source': 'planet', 'long_name': 'NDVI'}"
>>> "After reading with _FillValue=0.0: dtype=float32, attrs={'data_source': 'planet', 'long_name': 'NDVI'}"
>>> "Before writing with _FillValue=0: dtype=uint16, attrs={'_FillValue': 0, 'data_source': 'planet', 'long_name': 'NDVI'}"
>>> "After reading with _FillValue=0: dtype=float32, attrs={'data_source': 'planet', 'long_name': 'NDVI'}"
>>> "Before writing wihtout _FillValue: dtype=uint16, attrs={'data_source': 'planet', 'long_name': 'NDVI'}"
>>> "After reading without _FillValue: dtype=uint16, attrs={'data_source': 'planet', 'long_name': 'NDVI'}"

Scale and Offset¶

The scale and offset for normalization is automatically derived from the valid-range parameter of a BandCodec.

For disk encoding the following formula can be used to derive the scale and offset manually based on the valid-range and the disk-range:

offset = valid_range.min
scale = (valid_range.max - valid_range.min) / (disk_range.max - disk_range.min)

E.g. for NDVI with a valid_range=(-1.0, 1.0) and disk_range=(0, 20000):

> offset = valid_range.min
> scale = (valid_range.max - valid_range.min) / (disk_range.max - disk_range.min)
> offset, scale
-1., (2 / 20000) -> -1., 0.0001

Legacy support¶

In order to support legacy models, it is necessary to check which model version was used. For this, from now on all checkpoints get a new field model_version in their metadata. Fortunatly, all previous normalizations are equal to the new ones, hence to remapping needs to be done.