Dataset Contribution 🗂️

Overview

This page describes how to add a new dataset to scTimeBench.

Dataset contributions usually require three parts:

  1. a dataset loader in the dataset registry,

  2. a default dataset configuration, and

  3. inclusion in the supported metric groups.

Steps for Contribution

1. Format the dataset

Ensure the source data can be loaded into an AnnData object and that the following observation columns are available after loading:

  • cell type, mapped to ObservationColumns.CELL_TYPE

  • timepoint, mapped to ObservationColumns.TIMEPOINT

If a dataset does not contain one of these fields, follow the existing dataset loaders and provide a sensible placeholder or derived value.

2. Add a dataset loader

Create a loader in src/scTimeBench/shared/dataset/registry/ and follow the existing naming convention: the class name should end in Dataset, and the module name should be snake case.

A minimal loader looks like this:

"""
Dataset name.
"""

import scanpy as sc

from scTimeBench.shared.dataset.base import BaseDataset, ObservationColumns


class ExampleDataset(BaseDataset):
    def _load_data(self):
        """Load the dataset into self.data."""
        data_path = self.dataset_dict["data_path"]
        self.data = sc.read_h5ad(data_path)

        self.data.obs = self.data.obs.rename(
            columns={
                "cell_type": ObservationColumns.CELL_TYPE.value,
                "timepoint": ObservationColumns.TIMEPOINT.value,
            }
        )

Existing loaders in src/scTimeBench/shared/dataset/registry/ show the expected patterns for datasets with and without explicit cell-type labels.

If your dataset has no cell-type labels, set all cells to unknown:

self.data.obs[ObservationColumns.CELL_TYPE.value] = "unknown"

If your dataset does not have timepoint labels, generate pseudotime labels as an alternative.

Remember to export the new class from src/scTimeBench/shared/dataset/registry/__init__.py.

3. Add a default dataset entry

Register the dataset in src/scTimeBench/shared/dataset/default_datasets.yaml so that it can be loaded through a dataset tag.

The preprocessing steps are executed in order. Typical steps include lineage filtering, pseudotime inference, timepoint rounding, log-normalization, and the final train/test split.

datasets:
  - name: GarciaAlonsoDataset
    tag: defaultGarciaAlonso
    data_path: ./data/garcia-alonso/human_germ.h5ad
    data_preprocessing_steps:
      - name: LineageDatasetFilter
        cell_lineage_file: ./cell_lineages/germ/cell_line.txt
        cell_equivalence_file: ./cell_lineages/germ/equal_names.txt
      - name: RoundCellsToTimepoint
        min_cells_per_timepoint: 10
      - name: LogNormPreprocessor
      - name: CopyTrainTest

If the dataset is used only for optional ontology-based workflows, add a matching entry in src/scTimeBench/shared/dataset/optional_datasets.yaml.

4. Add the dataset to supported metric groups

Update the metric defaults so the new tag is discoverable by the relevant metric families. The current metric groups are defined in:

If the new dataset belongs to a group, add its tag to the matching dataset list in src/scTimeBench/shared/dataset/default_datasets.yaml and ensure the metric subclass supports the dataset class name.

5. Upload the data

Upload the dataset to a file hosting service such as Google Drive, Zenodo or Kaggle. This will facilitate our ability to update our Zenodo data release with your contributions.

6. Open a pull request

After the loader, configuration, and data references are in place, open a pull request with a clear description of:

  • the dataset source,

  • the preprocessing applied,

  • any caveats or missing annotations, and

  • the intended use cases.

Checklist

  • dataset loader added under src/scTimeBench/shared/dataset/registry/

  • dataset exported from registry/__init__.py

  • default dataset tag added to the appropriate YAML file

  • metric group defaults updated, if needed

  • data uploaded and linked in the pull request