Dataset Contribution 🗂️

Overview

This page describes how to add a new dataset to scTimeBench.

Dataset contributions usually require three parts:

a dataset loader in the dataset registry,
a default dataset configuration, and
inclusion in the supported metric groups.

Steps for Contribution

1. Format the dataset

Ensure the source data can be loaded into an AnnData object and that the following observation columns are available after loading:

cell type, mapped to ObservationColumns.CELL_TYPE

timepoint, mapped to ObservationColumns.TIMEPOINT

If a dataset does not contain one of these fields, follow the existing dataset loaders and provide a sensible placeholder or derived value.

2. Add a dataset loader

Create a loader in src/scTimeBench/shared/dataset/registry/ and follow the existing naming convention: the class name should end in Dataset, and the module name should be snake case.

A minimal loader looks like this:
"""
Dataset name.
"""

import scanpy as sc

from scTimeBench.shared.dataset.base import BaseDataset, ObservationColumns


class ExampleDataset(BaseDataset):
    def _load_data(self):
        """Load the dataset into self.data."""
        data_path = self.dataset_dict["data_path"]
        self.data = sc.read_h5ad(data_path)

        self.data.obs = self.data.obs.rename(
            columns={
                "cell_type": ObservationColumns.CELL_TYPE.value,
                "timepoint": ObservationColumns.TIMEPOINT.value,
            }
        )
Existing loaders in src/scTimeBench/shared/dataset/registry/ show the expected patterns for datasets with and without explicit cell-type labels.

If your dataset has no cell-type labels, set all cells to unknown:
self.data.obs[ObservationColumns.CELL_TYPE.value] = "unknown"
If your dataset does not have timepoint labels, generate pseudotime labels as an alternative.

Remember to export the new class from src/scTimeBench/shared/dataset/registry/__init__.py.

3. Add a default dataset entry

Register the dataset in src/scTimeBench/shared/dataset/default_datasets.yaml so that it can be loaded through a dataset tag.

The preprocessing steps are executed in order. Typical steps include lineage filtering, pseudotime inference, timepoint rounding, log-normalization, and the final train/test split.
datasets:
  - name: GarciaAlonsoDataset
    tag: defaultGarciaAlonso
    data_path: ./data/garcia-alonso/human_germ.h5ad
    data_preprocessing_steps:
      - name: LineageDatasetFilter
        cell_lineage_file: ./cell_lineages/germ/cell_line.txt
        cell_equivalence_file: ./cell_lineages/germ/equal_names.txt
      - name: RoundCellsToTimepoint
        min_cells_per_timepoint: 10
      - name: LogNormPreprocessor
      - name: CopyTrainTest
If the dataset is used only for optional ontology-based workflows, add a matching entry in src/scTimeBench/shared/dataset/optional_datasets.yaml.

4. Add the dataset to supported metric groups

Update the metric defaults so the new tag is discoverable by the relevant metric families. The current metric groups are defined in:

src/scTimeBench/metrics/embeddings/base.py

src/scTimeBench/metrics/ontology_based/base.py

src/scTimeBench/metrics/gex_prediction/base.py

If the new dataset belongs to a group, add its tag to the matching dataset list in src/scTimeBench/shared/dataset/default_datasets.yaml and ensure the metric subclass supports the dataset class name.

5. Upload the data

Upload the dataset to a file hosting service such as Google Drive, Zenodo or Kaggle. This will facilitate our ability to update our Zenodo data release with your contributions.

6. Open a pull request

After the loader, configuration, and data references are in place, open a pull request with a clear description of:

the dataset source,

the preprocessing applied,

any caveats or missing annotations, and

the intended use cases.

Checklist

dataset loader added under src/scTimeBench/shared/dataset/registry/
dataset exported from registry/__init__.py
default dataset tag added to the appropriate YAML file
metric group defaults updated, if needed
data uploaded and linked in the pull request