Default Run π¦ο
TODO : Update documentation for sqlite database management and csv extraction.
Overviewο
This page describes the default scTimeBench benchmark flow.
In the standard setup, you choose a method in the YAML file, list the compatible metrics, and let the benchmark resolve the default datasets for those metrics. The dataset definitions, preprocessing steps, and metric-group defaults are then pulled from the shared dataset registry.
The datasets, configuration files and scripts for methods implemented in the scTimeBench paper can be downloaded from Zenodo.
Start from a config fileο
The benchmark is started from the command line with a YAML file:
scTimeBench --config configs/scNODE/gex.yaml
The main entrypoint is implemented in src/scTimeBench/main.py. Configuration parsing and validation live in src/scTimeBench/config.py.
Select the methodο
The method section tells scTimeBench which runner to use and where to find the method-specific shell script. A minimal example looks like this:
method:
name: scNODE
train_and_test_script: ./methods/scNODE/train_and_test.sh
The method name must match the registered method class, and the script must point to the correct train_and_test.sh file under methods/.
Select compatible metricsο
The metrics list controls which evaluation families run for the chosen method. Each metric class declares the dataset class names it supports, so the config only needs to name the metric classes.
Example:
metrics:
- name: GraphSimMetric
If multiple metrics are listed, they are evaluated in order. Some configs repeat a metric with different parameters, such as alternate trajectory inference models.
Let the metric choose default datasetsο
If datasets are not listed explicitly in the config, scTimeBench uses the default datasets for the metric group. Those defaults are defined in src/scTimeBench/shared/dataset/default_datasets.yaml.
For example, the embedding and ontology-based metric groups use the default tags defined in that file, and then the framework resolves those tags to full dataset definitions before preprocessing starts.
Dataset overrides & customizationο
You can supply a datasets section to override the defaults. This is useful when you want a smaller subset, a different default tag, or a custom dataset entry.
Tag-based usage is the most common form:
datasets:
- tag: defaultGarciaAlonso
- tag: defaultGarciaAlonsoPseudotimeEvenCells
You can also provide a full dataset definition with a path and preprocessing steps when the dataset is not already in the shared registry.
Custom preprocessingο
Dataset preprocessing is configured per dataset entry and is executed in order. Typical preprocessing steps include:
lineage filtering,
pseudotime inference,
timepoint rounding,
log-normalization, and
train/test splitting.
Example:
datasets:
- name: GarciaAlonsoDataset
data_path: ./data/garcia-alonso/human_germ.h5ad
data_preprocessing_steps:
- name: LineageDatasetFilter
cell_lineage_file: ./cell_lineages/germ/cell_line.txt
cell_equivalence_file: ./cell_lineages/germ/equal_names.txt
- name: RoundCellsToTimepoint
min_cells_per_timepoint: 10
- name: LogNormPreprocessor
- name: CopyTrainTest
The config parser accepts either a full dataset definition or a tag-only entry, but not a mix of tag and explicit dataset fields.
Understand run modesο
The βrun_type option controls how much of the pipeline executes:
auto_train_test: train the method and then evaluate the metrics,
preprocess: only preprocess and prepare outputs,
eval_only: only evaluate previously generated outputs, and
train_only: train the method but skip metric evaluation.
The default run mode is auto_train_test.
Validate paths and outputsο
The config loader checks that dataset paths, lineage files, and method scripts exist before the run starts. It also requires:
method.name,
metrics, and
train_and_test_script when auto_train_test is used.
The benchmark then creates per-method output directories, caches processed data, and verifies the required output files before metric evaluation.
Checklistο
config file points to the desired method runner
metrics listed are compatible with that methodβs outputs
dataset tags match entries in the shared dataset registry
preprocessing steps are ordered correctly
paths exist for data files, lineage files, and method scripts