Trajectory Generation Pipeline¶
This document describes the synthetic cancer progression trajectory generation pipeline in renalprog.
Overview¶
The trajectory generation pipeline creates synthetic cancer progression paths by interpolating between early-stage and late-stage cancer samples in the VAE's latent space. This allows us to model intermediate states of cancer progression that may not be directly observable in the data.
Biological Motivation¶
Cancer progression is a continuous process, but clinical data typically captures discrete snapshots (stages I-IV). Synthetic trajectories help us:
- Model Intermediate States: Generate hypothetical intermediate stages between observed samples
- Identify Progression Markers: Find genes and pathways that change during progression
- Understand Dynamics: Study the temporal ordering of molecular events
- Generate Hypotheses: Predict which pathways are activated during progression
Pipeline Steps¶
1. Patient Connection Creation¶
Identify pairs of patients to connect:
from renalprog.modeling.trajectories import create_patient_connections
connections = create_patient_connections(
clinical_df=clinical_data,
source_stage="early", # Stage I-II
target_stage="late", # Stage III-IV
method="all_to_all" # Connect every early to every late
)
# Each connection: (source_patient_id, target_patient_id)
Connection strategies: - all_to_all: Every early patient → every late patient - nearest_neighbor: Early patients → nearest late patients in latent space - matched: Match on clinical covariates (age, sex, etc.)
2. Latent Space Interpolation¶
Generate intermediate points in VAE latent space:
from renalprog.modeling.predict import generate_trajectories
trajectories = generate_trajectories(
model=vae_model,
source_samples=early_stage_data,
target_samples=late_stage_data,
n_steps=50, # Number of intermediate timepoints
method="linear" # Interpolation method
)
# Output shape: (n_trajectories, n_steps, n_genes)
Interpolation methods: - linear: Straight line in latent space (default) - spherical: Interpolate along geodesic on hypersphere - bezier: Smooth curve using Bezier interpolation
3. Gene Expression Reconstruction¶
Decode latent trajectories to gene expression space:
from renalprog.modeling.predict import reconstruct_from_latent
# For each timepoint, decode latent vector to gene expression
gene_expression = []
for t in range(n_steps):
z_t = trajectories[:, t, :] # Latent vectors at time t
x_t = vae_model.decode(z_t) # Reconstruct gene expression
gene_expression.append(x_t)
# Shape: (n_trajectories, n_steps, n_genes)
gene_expression = np.stack(gene_expression, axis=1)
4. Quality Control¶
Validate trajectory quality:
from renalprog.modeling.trajectories import validate_trajectories
quality_metrics = validate_trajectories(
trajectories=gene_expression,
early_samples=X_early,
late_samples=X_late,
vae_model=vae_model
)
# Metrics:
# - reconstruction_error: How well VAE reconstructs samples
# - monotonicity: Do progression markers increase/decrease consistently?
# - smoothness: Are trajectories smooth or erratic?
# - stage_separation: Do endpoints match early/late distributions?
5. Control Trajectory Generation¶
Generate negative controls for comparison:
from renalprog.modeling.trajectories import generate_control_trajectories
# Random noise trajectories
noise_controls = generate_control_trajectories(
n_trajectories=len(trajectories),
n_steps=n_steps,
n_genes=n_genes,
method="gaussian_noise",
noise_scale=1.0
)
# Shuffled gene trajectories
shuffled_controls = generate_control_trajectories(
trajectories=trajectories,
method="shuffle_genes"
)
Control types: - gaussian_noise: Random Gaussian noise - shuffle_genes: Shuffle gene order independently - shuffle_time: Randomize temporal ordering - reverse: Reverse early→late to late→early
Output Files¶
The trajectory pipeline generates:
data/interim/trajectories/
├── early_to_late/
│ ├── trajectory_metadata.csv # Patient pairs, stages
│ ├── latent_trajectories.npy # Latent space paths
│ ├── gene_expression.npy # Reconstructed expression
│ └── quality_metrics.json # Validation metrics
├── controls/
│ ├── noise_controls.npy # Random noise
│ └── shuffled_controls.npy # Shuffled genes
└── visualizations/
├── latent_space_trajectories.png
├── sample_trajectories.png
└── progression_heatmap.png
Usage Example¶
Complete Trajectory Workflow¶
from renalprog.modeling.predict import generate_trajectories
from renalprog.modeling.trajectories import (
create_patient_connections,
validate_trajectories,
generate_control_trajectories
)
from renalprog.config import get_dated_dir, INTERIM_DATA_DIR
import numpy as np
# 1. Load trained VAE
from renalprog.modeling.train import load_vae
vae = load_vae("models/20251216_VAE_KIRC/best_model.pth")
# 2. Load data
X_early = np.load("data/interim/samples/early_stage.npy")
X_late = np.load("data/interim/samples/late_stage.npy")
# 3. Create connections
connections = create_patient_connections(
clinical_df=clinical,
source_stage="early",
target_stage="late",
method="all_to_all"
)
# 4. Generate trajectories
trajectories = generate_trajectories(
model=vae,
source_samples=X_early,
target_samples=X_late,
n_steps=50,
method="linear"
)
# 5. Validate quality
metrics = validate_trajectories(
trajectories=trajectories,
early_samples=X_early,
late_samples=X_late,
vae_model=vae
)
# 6. Generate controls
controls = generate_control_trajectories(
n_trajectories=len(trajectories),
n_steps=50,
n_genes=trajectories.shape[-1],
method="gaussian_noise"
)
# 7. Save results
output_dir = get_dated_dir(INTERIM_DATA_DIR, "synthetic_trajectories")
np.save(f"{output_dir}/trajectories.npy", trajectories)
np.save(f"{output_dir}/controls.npy", controls)
print(f"Generated {len(trajectories)} trajectories")
print(f"Quality metrics: {metrics}")
Script Usage¶
# Run trajectory generation pipeline
python scripts/pipeline_steps/4_trajectories.py \
--vae_model models/20251216_VAE_KIRC/best_model.pth \
--data_dir data/interim/20251216_train_test_split \
--output_dir data/interim/20251216_synthetic_trajectories \
--n_steps 50 \
--method linear \
--generate_controls
Visualization¶
Latent Space Trajectories¶
Visualize trajectories in 2D latent space (using UMAP/t-SNE):
from renalprog.plots import plot_latent_trajectories
fig = plot_latent_trajectories(
latent_trajectories=z_trajectories,
early_latent=z_early,
late_latent=z_late,
reduction_method="umap"
)
fig.savefig("reports/figures/latent_trajectories.png")
Gene Expression Heatmaps¶
Show how genes change along trajectories:
from renalprog.plots import plot_trajectory_heatmap
fig = plot_trajectory_heatmap(
trajectories=trajectories[:10], # First 10 trajectories
gene_names=important_genes,
n_genes_show=50
)
fig.savefig("reports/figures/trajectory_heatmap.png")
Individual Trajectory Plots¶
Plot selected genes over time:
from renalprog.plots import plot_individual_trajectory
genes_of_interest = ["VHL", "PBRM1", "SETD2", "BAP1"]
fig = plot_individual_trajectory(
trajectory=trajectories[0],
gene_names=all_genes,
highlight_genes=genes_of_interest
)
fig.savefig("reports/figures/individual_trajectory.png")
Advanced Topics¶
Custom Interpolation Methods¶
Implement custom interpolation:
from renalprog.modeling.trajectories import BaseInterpolator
class CustomInterpolator(BaseInterpolator):
def interpolate(self, z_start, z_end, n_steps):
"""Custom interpolation between latent vectors."""
# Your custom interpolation logic
# For example, cubic spline:
from scipy.interpolate import CubicSpline
t = np.linspace(0, 1, n_steps)
cs = CubicSpline([0, 1], np.vstack([z_start, z_end]))
return cs(t)
# Use custom interpolator
trajectories = generate_trajectories(
model=vae,
source_samples=X_early,
target_samples=X_late,
n_steps=50,
interpolator=CustomInterpolator()
)
Trajectory Clustering¶
Identify different progression patterns:
from sklearn.cluster import KMeans
from renalprog.modeling.trajectories import extract_trajectory_features
# Extract features from trajectories
features = extract_trajectory_features(
trajectories,
method="slope" # Gene expression slopes
)
# Cluster trajectories
kmeans = KMeans(n_clusters=3, random_state=42)
clusters = kmeans.fit_predict(features)
# Analyze each cluster
for i in range(3):
cluster_trajs = trajectories[clusters == i]
print(f"Cluster {i}: {len(cluster_trajs)} trajectories")
Multi-Stage Trajectories¶
Generate trajectories across multiple stages:
# Generate trajectories: Stage I → Stage II → Stage III → Stage IV
stages = ["stage_i", "stage_ii", "stage_iii", "stage_iv"]
multi_stage_trajs = []
for i in range(len(stages) - 1):
trajs = generate_trajectories(
model=vae,
source_samples=samples[stages[i]],
target_samples=samples[stages[i+1]],
n_steps=20,
method="linear"
)
multi_stage_trajs.append(trajs)
# Concatenate all segments
full_trajectories = np.concatenate(multi_stage_trajs, axis=1)
Interpretation Guidelines¶
What Do Trajectories Represent?¶
- Not Real Patients: Trajectories are hypothetical, model-generated paths
- Biological Plausibility: Validated by enrichment analysis and clinical markers
- Statistical Averages: Represent general trends, not individual variation
- Hypothesis Generation: Use to identify candidates for experimental validation
Common Pitfalls¶
- Over-interpretation: Don't assume every trajectory detail is biologically meaningful
- Model Artifacts: VAE may introduce interpolation artifacts
- Missing Biology: Model can't capture unknown mechanisms
- Data Bias: Trajectories reflect training data distributions
Best Practices¶
- Generate Controls: Always compare to random/shuffled controls
- Validate Endpoints: Ensure trajectories start/end at correct stages
- Check Monotonicity: Progression markers should change consistently
- Cross-validate: Compare to independent validation datasets
- Biological Validation: Prioritize findings that match known biology
Performance Considerations¶
Memory Usage¶
For large datasets:
# Generate trajectories in batches
batch_size = 100
all_trajectories = []
for i in range(0, len(connections), batch_size):
batch_connections = connections[i:i+batch_size]
batch_trajs = generate_trajectories(
model=vae,
source_samples=X_early[batch_connections[:, 0]],
target_samples=X_late[batch_connections[:, 1]],
n_steps=50
)
all_trajectories.append(batch_trajs)
trajectories = np.concatenate(all_trajectories, axis=0)
Computational Speed¶
- Linear interpolation: Fast (~1-10 ms per trajectory)
- Spherical interpolation: Medium (~10-50 ms per trajectory)
- Bezier interpolation: Slow (~50-200 ms per trajectory)
For 1000 trajectories with 50 steps: - Linear: ~10 seconds - Spherical: ~1 minute - Bezier: ~5 minutes
See Also¶
- VAE Training - Train the VAE model
- Classification - Classify trajectory timepoints
- Enrichment Analysis - Find enriched pathways
- API Reference - Detailed API documentation
References¶
-
White, T. (2016). Sampling Generative Networks. arXiv preprint arXiv:1609.04468.
-
Shoemake, K. (1985). Animating rotation with quaternion curves. ACM SIGGRAPH Computer Graphics, 19(3), 245-254.
-
Way, G. P., & Greene, C. S. (2018). Extracting a biologically relevant latent space from cancer transcriptomes with variational autoencoders. Pacific Symposium on Biocomputing, 23, 80-91.
-
Ding, J., et al. (2018). Interpretable dimensionality reduction of single cell transcriptome data with deep generative models. Nature Communications, 9(1), 1-13.