Dataset Contribution 🗂️ ======================= Overview -------- This page describes how to add a new dataset to scTimeBench. Dataset contributions usually require three parts: 1. a dataset loader in the dataset registry, 2. a default dataset configuration, and 3. inclusion in the supported metric groups. Steps for Contribution ---------------------- 1. Format the dataset ~~~~~~~~~~~~~~~~~~~~~ Ensure the source data can be loaded into an AnnData object and that the following observation columns are available after loading: * cell type, mapped to ObservationColumns.CELL_TYPE * timepoint, mapped to ObservationColumns.TIMEPOINT If a dataset does not contain one of these fields, follow the existing dataset loaders and provide a sensible placeholder or derived value. 2. Add a dataset loader ~~~~~~~~~~~~~~~~~~~~~~~ Create a loader in `src/scTimeBench/shared/dataset/registry/ `_ and follow the existing naming convention: the class name should end in Dataset, and the module name should be snake case. A minimal loader looks like this: .. code-block:: python """ Dataset name. """ import scanpy as sc from scTimeBench.shared.dataset.base import BaseDataset, ObservationColumns class ExampleDataset(BaseDataset): def _load_data(self): """Load the dataset into self.data.""" data_path = self.dataset_dict["data_path"] self.data = sc.read_h5ad(data_path) self.data.obs = self.data.obs.rename( columns={ "cell_type": ObservationColumns.CELL_TYPE.value, "timepoint": ObservationColumns.TIMEPOINT.value, } ) Existing loaders in `src/scTimeBench/shared/dataset/registry/ `_ show the expected patterns for datasets with and without explicit cell-type labels. If your dataset has no cell-type labels, set all cells to ``unknown``: .. code-block:: python self.data.obs[ObservationColumns.CELL_TYPE.value] = "unknown" If your dataset does not have timepoint labels, generate pseudotime labels as an alternative. Remember to export the new class from `src/scTimeBench/shared/dataset/registry/__init__.py `_. 3. Add a default dataset entry ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Register the dataset in `src/scTimeBench/shared/dataset/default_datasets.yaml `_ so that it can be loaded through a dataset tag. The preprocessing steps are executed in order. Typical steps include lineage filtering, pseudotime inference, timepoint rounding, log-normalization, and the final train/test split. .. code-block:: yaml datasets: - name: GarciaAlonsoDataset tag: defaultGarciaAlonso data_path: ./data/garcia-alonso/human_germ.h5ad data_preprocessing_steps: - name: LineageDatasetFilter cell_lineage_file: ./cell_lineages/germ/cell_line.txt cell_equivalence_file: ./cell_lineages/germ/equal_names.txt - name: RoundCellsToTimepoint min_cells_per_timepoint: 10 - name: LogNormPreprocessor - name: CopyTrainTest If the dataset is used only for optional ontology-based workflows, add a matching entry in `src/scTimeBench/shared/dataset/optional_datasets.yaml `_. 4. Add the dataset to supported metric groups ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Update the metric defaults so the new tag is discoverable by the relevant metric families. The current metric groups are defined in: * `src/scTimeBench/metrics/embeddings/base.py `_ * `src/scTimeBench/metrics/ontology_based/base.py `_ * `src/scTimeBench/metrics/gex_prediction/base.py `_ If the new dataset belongs to a group, add its tag to the matching dataset list in `src/scTimeBench/shared/dataset/default_datasets.yaml `_ and ensure the metric subclass supports the dataset class name. 5. Upload the data ~~~~~~~~~~~~~~~~~~ Upload the dataset to a file hosting service such as Google Drive, Zenodo or Kaggle. This will facilitate our ability to update our Zenodo data release with your contributions. 6. Open a pull request ~~~~~~~~~~~~~~~~~~~~~~ After the loader, configuration, and data references are in place, open a pull request with a clear description of: * the dataset source, * the preprocessing applied, * any caveats or missing annotations, and * the intended use cases. Checklist --------- * dataset loader added under ``src/scTimeBench/shared/dataset/registry/`` * dataset exported from ``registry/__init__.py`` * default dataset tag added to the appropriate YAML file * metric group defaults updated, if needed * data uploaded and linked in the pull request