API Reference πο
Core Runtimeο
These modules control configuration parsing, persistence, and the top-level benchmark entrypoint.
config.py
Configuration management for YAML-based configs, similar to the tf-binding project. Handles both YAML file loading and command-line argument parsing.
- class scTimeBench.config.Configο
Bases:
objectConfig class for both yaml and cli arguments.
- class scTimeBench.config.CsvExportType(value)ο
Bases:
EnumAn enumeration.
- EMBEDDING = 'embedding'ο
- GEX_PRED = 'gex_pred'ο
- GRAPH_SIM = 'graph_sim'ο
- class scTimeBench.config.CsvWriteMode(value)ο
Bases:
EnumAn enumeration.
- MERGE = 'merge'ο
- SEPARATE = 'separate'ο
- class scTimeBench.config.RunType(value)ο
Bases:
EnumAn enumeration.
- AUTO_TRAIN_TEST = 'auto_train_test'ο
- EVAL_ONLY = 'eval_only'ο
- PREPROCESS = 'preprocess'ο
- TRAIN_ONLY = 'train_only'ο
Database manager using sqlite3.
This module provides a simple interface to interact with an SQLite database, including the setup of tables for storing: 1. Paths to processed datasets. 2. Paths to method checkpoints. 3. Paths to method predictions. 4. Metric results.
- class scTimeBench.database.DatabaseManager(config: Config)ο
Bases:
object- clear_tables()ο
- close()ο
- embedding_to_csv(output_csv_path, append=False)ο
- get_dataset_id(method: MethodManager)ο
- get_dataset_tag_from_id(dataset_id)ο
- get_evals_per_method(method: MethodManager)ο
- get_evals_per_metric(metric_name: str, metric_params: str)ο
- get_method_output_path(method: MethodManager)ο
- gex_pred_to_csv(output_csv_path, append=False)ο
- graph_sim_to_csv(output_csv_path, append=False)ο
- has_eval(method: MethodManager, metric_name: str, metric_params: str) boolο
- has_metric(name: str, parameters: str) boolο
- insert_dataset(dataset: BaseDataset)ο
- insert_dataset_metric(dataset: BaseDataset, metric_name, metric_params, result)ο
- insert_eval(method: MethodManager, metric_name: str, metric_params: str, result)ο
- insert_method_output(method: MethodManager, output_path: str)ο
- insert_metric(name: str, parameters: str)ο
- print_all()ο
- return_all()ο
main.py. Entrypoint for measuring trajectories in single-cell data, particularly involving gene regulatory networks and cell lineage information.
- scTimeBench.main.main()ο
Main entrypoint for the scTimeBench (crispy-fishstick) package.
Dataset Infrastructureο
These modules define dataset loading, preprocessing, shared constants, and utility helpers used throughout the benchmark.
Bases:
EnumAn enumeration.
Bases:
EnumAn enumeration.
Bases:
objectCreate a directory for this dataset configuration under the given base path.
Generate a string representation of the dataset configuration.
This can be used to cache processed datasets.
Generate a string representation of the applied dataset preprocessors and their parameters.
This can be used to cache processed datasets.
We define a checkpoint as the ith preprocessor in the pipeline. This is used to save intermediate results that take a while to get to (such as pseudotime estimation).
Get a unique directory name for this dataset configuration, which can be used for caching. This is based on the dataset name, the encoded dataset dictionary, and the encoded preprocessors.
It should be a hashable string that uniquely identifies the dataset configuration and applied preprocessors, so that we can cache processed datasets effectively.
Get the name of the dataset from the configuration.
This ensures that the dataset loading is done properly.
We require the following: 1. Load the data from the source. 2. Include observation metadata of cell_type, and timepoint. 3. Drop everything else not required, to speed up processing. 4. Apply the dataset preprocessors provided. 5. Return the train and test splits.
Update: > Because Iβm getting annoyed about the dependency hell we need for psupertimeβ¦ > Iβve decided that the best way forward is to simply add pypsupertime as a possible > thing to have, but not necessary. Instead, we would require them to run the preprocessing > ahead of time, which is what this function does β loads the data (running them through the preprocessor) > and saving them to their respective output directory.
Some datasets might require caching because they have preprocessors that take a long time to run (e.g., pseudotime estimation). By default, we assume that datasets do not require caching, but this can be overridden by specific datasets if necessary.
Bases:
objectSubclasses should implement this method to preprocessor and split the dataset according to the metricβs requirements.
By default, most preprocessors should be simple and not require external packages.
Decorator to register a dataset class in the DATASET_REGISTRY.
Decorator to register a dataset preprocessor class in the DATASET_PREPROCESSOR_REGISTRY.
Clear the in-memory dataset cache.
Get the dataset from the pickled dataset file in output_path.
- Args:
output_path: Path to the method output directory
- Returns:
The dataset object loaded from the pickled file
Heuristic to determine if the data is log-normalized to a certain counts threshold. Checks if ann_data.X is raw and if not, then checks to see that the data is log-normalized to counts=10_000.
- Args:
ann_data: The AnnData object to check counts: The expected counts value (default is 10_000)
- Returns:
True if the data is log-normalized to the expected counts, False otherwise
Returns whether the data is raw (i.e. not log-normalized) by checking that: 1. All the data is non-negative 2. All the data is integer-valued
Load a method output file from output_path.
- Args:
output_path: Path to the method output directory required_output: RequiredOutputFiles enum value specifying which file to load
- Returns:
For .npy files: numpy array For .parquet files: pandas DataFrame
Load the test dataset from the pickled dataset file in output_path.
- Args:
output_path: Path to the method output directory
- Returns:
The test AnnData object from the dataset
Method Executionο
These modules provide the method runner interface and the helper used by the benchmark to launch methods and collect their outputs.
Note: for this file only, this will be used by other methods as a base class And so its context is outside the src/ folder, so we need to use scTimeBench.* imports instead of relative imports.
- class scTimeBench.method_utils.method_runner.BaseMethod(yaml_config)ο
Bases:
object- generate(test_ann_data)ο
Main generation method that dispatches to individual output generators. Each output is saved to its own file under self.output_path.
- generate_embedding(test_ann_data) ndarrayο
Generate embeddings for the current timepoint. Returns: np.ndarray of shape (n_cells, embedding_dim)
- generate_next_cell_type(test_ann_data) DataFrameο
Generate next cell type predictions. Returns: pd.DataFrame with cell type predictions
- generate_next_tp_embedding(test_ann_data) ndarrayο
Generate embeddings for the next timepoint. Returns: np.ndarray of shape (n_cells, embedding_dim)
- generate_next_tp_gex(test_ann_data) ndarrayο
Generate gene expression for the next timepoint. Returns: np.ndarray of shape (n_cells, n_genes)
- generate_pred_graph(test_ann_data) ndarrayο
Generate predicted graph. Returns: np.ndarray representing the predicted graph
- generate_zero_to_end_pred_gex(first_tp_cells, all_tps) AnnDataο
Generate predicted gene expression from the first to the last timepoint. Returns: AnnData object with predicted gene expression across all timepoints
- train(ann_data, all_tps=None)ο
- scTimeBench.method_utils.method_runner.get_parser()ο
- scTimeBench.method_utils.method_runner.main(method_class: BaseMethod)ο
- scTimeBench.method_utils.method_runner.process_yaml(yaml_path)ο
- class scTimeBench.method_utils.ot_method_runner.BaseOTMethod(yaml_config)ο
Bases:
BaseMethodBase class for OT-based methods.
- generate_embedding(test_ann_data) ndarrayο
Generate PCA embeddings from gene expression data.
- generate_next_cell_type(test_ann_data) DataFrameο
Generate next cell type predictions using transport plan.
- generate_next_tp_embedding(test_ann_data) ndarrayο
Generate embeddings for the next timepoint using transport plan.
- generate_next_tp_gex(test_ann_data) ndarrayο
Generate gene expression for the next timepoint using transport plan.
- get_transport_plan(source_data, target_data)ο
Given source and target data, compute the transport plan. Subclasses representing OT methods should implement this method.
Parameters:ο
- source_datanp.ndarray
Source data matrix (cells x features)
- target_datanp.ndarray
Target data matrix (cells x features)
Returns:ο
- np.ndarray
Transport plan matrix (source cells x target cells)
- train(ann_data, all_tps=None)ο
Metric Frameworkο
These modules define the metric base class and the method manager used to bind datasets to method outputs during evaluation.
Base class for all metrics. They should all implement the eval method, and depend on the dataset that they belong to.
- class scTimeBench.metrics.base.BaseMetric(config: Config, db_manager: DatabaseManager, metric_config: dict)ο
Bases:
object- final eval()ο
Evaluation function that handles the calling of submetrics if applicable.
Basically it happens as follows:
If there are submetrics defined, we create an instance of each submetric.
We call the _eval function of each submetric.
From this _eval function, we further call the _submetric_eval function that each subclass must implement.
- scTimeBench.metrics.base.register_metric(cls)ο
Decorator to register a metric class in the METRIC_REGISTRY.
- scTimeBench.metrics.base.skip_metric(cls)ο
Decorator to register a skip metric class in the SKIP_METRIC_REGISTRY.
Method Base Class.
- class scTimeBench.metrics.method_manager.MethodManager(config, dataset: BaseDataset)ο
Bases:
object- train_and_test(yaml_config_path)ο
Runs the train and test script provided in the config.
Trajectory Inferenceο
These modules implement the trajectory inference abstractions and concrete inference strategies used by the metrics.
Base trajectory inference model.
This is the base class for all trajectory inference models, i.e. given an ann data and its timepoints, we want to infer the trajectory structure.
Examples are the kNN graph-based methods, or the optimal transport based methods.
- class scTimeBench.trajectory_infer.base.BaseTrajectoryInferMethod(traj_config)ο
Bases:
object- encode()ο
Hash the trajectory inference method based on its class name and parameters.
- encode_for_classifier()ο
Hash the trajectory inference method for the classifier based on its class name and parameters.
This is different from the regular encode because we want to ignore the from_tp_zero because that should be shared regardless of the from_tp_zero setting.
- final infer_trajectory(output_path, per_tp=False)ο
Infer the trajectory using the kNN graph-based method.
Separate each embedding by time.
Find the k nearest neighbors in the next time point embedding space.
Consolidate the cell types per time point based on the kNN results.
- predict_next_tp(output_path, test_ann_data=None, traj_infer_path=None)ο
Predict the next timepoint cell types using the trajectory inference model.
- supports_gex()ο
Function to be overwritten if the trajectory inference method can support By default, we assume it does not.
- final train_and_predict(output_path, train_only=False)ο
Trains and predicts using the trajectory inference model.
- train_and_predict_k_fold_cv(output_path, k)ο
Does the train and predict with k-fold cross validation.
We store everything under traj_infer_path/k_fold_<k>/fold_<i>/
- uses_gene_expr()ο
- class scTimeBench.trajectory_infer.base.TrajectoryInferenceMethodFactoryο
Bases:
object- get_trajectory_infer_method(traj_config) BaseTrajectoryInferMethodο
- scTimeBench.trajectory_infer.base.register_trajectory_inference_method(cls)ο
Decorator to register a trajectory inference method.
Classifier implementation for trajectory inference.
- class scTimeBench.trajectory_infer.classifier.CellTypist(traj_config)ο
Bases:
BaseTrajectoryInferMethod
- class scTimeBench.trajectory_infer.classifier.Classifier(traj_config)ο
Bases:
BaseTrajectoryInferMethod
- class scTimeBench.trajectory_infer.classifier.ClassifierTypes(value)ο
Bases:
EnumAn enumeration.
- BOOSTING = 'boosting'ο
- RANDOM_FOREST = 'random_forest'ο
kNN implementation for trajectory inference.
- class scTimeBench.trajectory_infer.kNN.kNN(traj_config)ο
Bases:
BaseTrajectoryInferMethod- get_kNN_graph(output_path)ο
Function to get the kNN graph used in the trajectory inference.
This can be useful for visualization or further analysis.
- class scTimeBench.trajectory_infer.kNN.kNNStrategy(value)ο
Bases:
EnumAn enumeration.
- MAJORITY_VOTE = 'majority_vote'ο
- WEIGHTED_AVERAGE = 'weighted_average'ο
OT implementation for trajectory inference.
- class scTimeBench.trajectory_infer.ot.OptimalTransport(traj_config)ο
Bases:
BaseTrajectoryInferMethodWARNING: This is untested and deprecated.
Please switch to either kNN or Classifier with scikit-learn based classifiers for better performance and maintainability.
- cell_types_to_one_hot(cell_types)ο
Given a list of cell types, convert to one-hot encoding
- get_ot_labels(true_embed, pred_embed, one_hot_labels)ο
Given the true embeddings, predicted embeddings and one-hot encoding of true cell types, get the transport plan using optimal transport
- soft_labels_to_cell_types(labels, index_to_type)ο
Given the labels from get_ot_labels, and the index to type mapping, convert the soft labels to hard cell type labels
- supports_gex()ο
By default OT does not have enough capacity to support gene expression data, as it is primarily designed for embedding-based trajectory inference. This is because OT can be computationally intensive and may not scale well with high-dimensional gene expression data, leading to longer runtimes and potential memory issues.