data_validate
darts_segmentation.training.data_validate
¶
CLI-ready function for validating a training dataset config based on a training dataset.
validate_dataset
¶
validate_dataset(
train_data_dir: str | pathlib.Path,
data_split_method: typing.Literal[
"random", "region", "sample"
]
| None = None,
data_split_by: list[str | float] | None = None,
fold_method: typing.Literal[
"kfold",
"shuffle",
"stratified",
"region",
"region-stratified",
]
| None = "kfold",
total_folds: int = 5,
subsample: int | None = None,
bands: list[str] | None = None,
)
Validate a training dataset config based on a training dataset.
Please see the DartsDataModule for more information.
Parameters:
-
train_data_dir
(pathlib.Path
) –The path to the data to be used for training. Expects a directory containing: 1. a zarr group called "data.zarr" containing a "x" and "y" array 2. a geoparquet file called "metadata.parquet" containing the metadata for the data. This metadata should contain at least the following columns: - "sample_id": The id of the sample - "region": The region the sample belongs to - "empty": Whether the image is empty The index should refer to the index of the sample in the zarr data. This directory should be created by a preprocessing script.
-
data_split_method
(typing.Literal['random', 'region', 'sample'] | None
, default:None
) –The method to use for splitting the data into a train and a test set. "random" will split the data randomly, the seed is always 42 and the test size can be specified by providing a list with a single a float between 0 and 1 to data_split_by This will be the fraction of the data to be used for testing. E.g. [0.2] will use 20% of the data for testing. "region" will split the data by one or multiple regions, which can be specified by providing a str or list of str to data_split_by. "sample" will split the data by sample ids, which can also be specified similar to "region". If None, no split is done and the complete dataset is used for both training and testing. The train split will further be split in the cross validation process. Defaults to None.
-
data_split_by
(list[str | float] | None
, default:None
) –Select by which regions/samples to split or the size of test set. Defaults to None.
-
fold_method
(typing.Literal['kfold', 'shuffle', 'stratified', 'region', 'region-stratified'] | None
, default:'kfold'
) –Method for cross-validation split. Defaults to "kfold".
-
total_folds
(int
, default:5
) –Total number of folds in cross-validation. Defaults to 5.
-
subsample
(int | None
, default:None
) –If set, will subsample the dataset to this number of samples. This is useful for debugging and testing. Defaults to None.
-
bands
(Bands | list[str] | None
, default:None
) –List of bands to use. Expects the data_dir to contain a config.toml with a "darts.bands" key, with which the indices of the bands will be mapped to. Defaults to None.