Complete Pipeline Tutorial¶
This tutorial walks through the complete renalprog analysis pipeline from raw TCGA data to pathway enrichment results. This represents the full workflow used in the publication.
Overview¶
The pipeline consists of six main steps:
- Data Processing: Download and preprocess TCGA data
- VAE Training: Train deep generative models
- Reconstruction Validation: Assess model quality
- Trajectory Generation: Create synthetic progression paths
- Classification: Stage prediction and biomarker discovery
- Enrichment Analysis: Pathway-level interpretation
Total Time: ~8-12 hours (varies with hardware)
Requirements: - 16+ GB RAM - 50+ GB free disk space - GPU recommended for steps 2-4 - R 4.0+ for step 6
Prerequisites¶
Ensure you have completed the installation:
Pipeline Execution¶
Option 1: Run All Steps Sequentially¶
# Navigate to scripts directory
cd scripts/pipeline_steps
# Step 1: Data Processing (30 minutes)
python 1_data_processing.py
# Step 2: VAE Training (2-4 hours with GPU)
python 2_models.py
# Step 3: Reconstruction Check (10 minutes)
python 3_check_reconstruction.py
# Step 4: Trajectory Generation (30 minutes)
python 4_trajectories.py
# Step 5: Classification (1 hour)
python 5_classification.py
# Step 6: Enrichment Analysis (2-6 hours with 8 cores)
python 6_enrichment_analysis.py
Option 2: Run with Makefile¶
# Run entire pipeline
make pipeline
# Or run individual steps
make data
make models
make trajectories
make classification
make enrichment
Option 3: Use Job Scripts (HPC)¶
If you're on an HPC cluster with SLURM:
# Submit all jobs
sbatch jobs/job_data
sbatch jobs/job_models
sbatch jobs/job_trajectories
sbatch jobs/job_classification
sbatch jobs/job_enrichment
Detailed Step-by-Step Guide¶
Step 1: Data Processing¶
Script: scripts/pipeline_steps/1_data_processing.py
What it does: - Downloads TCGA KIRC RNA-seq and clinical data - Filters low-expression genes - Removes outliers using Mahalanobis distance - Creates train/test splits
Expected outputs:
data/interim/preprocessed_KIRC_data/
├── preprocessed_rnaseq.csv # Filtered gene expression (498 samples × ~5000 genes)
├── preprocessing_info.json # Filter statistics
└── stages.csv # Cancer stage labels
Validation:
import pandas as pd
rnaseq = pd.read_csv('data/interim/preprocessed_KIRC_data/preprocessed_rnaseq.csv', index_col=0)
print(f"Samples: {rnaseq.shape[0]}, Genes: {rnaseq.shape[1]}")
# Expected: Samples: 498, Genes: ~5000
# Check for missing values
assert rnaseq.isnull().sum().sum() == 0, "Missing values detected!"
print("✓ Data quality check passed")
Time: ~30 minutes
See: Data Processing Tutorial
Step 2: VAE Training¶
Script: scripts/pipeline_steps/2_models.py
What it does: - Trains Variational Autoencoder on gene expression - Learns low-dimensional latent representation - Trains reconstruction network for improved decoding - Saves models and training histories
Configuration (in script):
vae_config.INPUT_DIM = X_train.shape[1] # Number of genes
vae_config.MID_DIM = 512 # Hidden layer size
vae_config.LATENT_DIM = 256 # Latent space dimensionality
vae_config.EPOCHS = 600 # Total epochs (3 cycles × 200)
vae_config.BATCH_SIZE = 8
Expected outputs:
models/20251217_models_KIRC/
├── vae/
│ ├── vae_model.pt # Trained VAE
│ ├── vae_config.json # Model configuration
│ └── vae_training_history.png # Loss curves
├── reconstruction/
│ ├── reconstruction_network.pt # Post-processing network
│ └── reconstruction_network_history.png
└── training_data/
├── X_train.csv
└── X_test.csv
Validation:
import torch
from renalprog.modeling.models import VAE
# Load model
vae = VAE.load('models/20251217_models_KIRC/vae/vae_model.pt')
# Check reconstruction
X_test = pd.read_csv('models/20251217_models_KIRC/training_data/X_test.csv', index_col=0)
X_tensor = torch.FloatTensor(X_test.values)
with torch.no_grad():
X_recon = vae(X_tensor)[0]
mse = ((X_tensor - X_recon) ** 2).mean()
print(f"Reconstruction MSE: {mse:.4f}")
# Expected: MSE < 1.0 for well-trained model
Time: 2-4 hours (GPU), 8-12 hours (CPU)
See: VAE Training Tutorial
Step 3: Reconstruction Validation¶
Script: scripts/pipeline_steps/3_check_reconstruction.py
What it does: - Visualizes latent space with UMAP/t-SNE - Compares original vs. reconstructed gene expression - Generates quality control plots
Expected outputs:
reports/figures/reconstruction/
├── latent_space_umap.png # UMAP colored by stage
├── latent_space_tsne.png # t-SNE colored by stage
├── reconstruction_scatter.png # Original vs. reconstructed
└── gene_correlation.png # Per-gene reconstruction quality
Validation:
# Check that plots exist
import os
plot_dir = 'reports/figures/reconstruction/'
required_plots = [
'latent_space_umap.png',
'reconstruction_scatter.png'
]
for plot in required_plots:
assert os.path.exists(os.path.join(plot_dir, plot)), f"Missing: {plot}"
print("✓ All validation plots generated")
Time: ~10 minutes
See: Reconstruction Tutorial
Step 4: Trajectory Generation¶
Script: scripts/pipeline_steps/4_trajectories.py
What it does: - Generates synthetic patient trajectories - Interpolates between early and late stage samples - Decodes trajectories to gene expression space - Creates multiple trajectory types
Configuration:
n_trajectories = 500 # Number of trajectories to generate
n_timepoints = 20 # Timepoints per trajectory
trajectory_types = [
'early_to_late', # Early stage → Late stage
'stage1_to_stage4', # Stage I → Stage IV
'matched_pairs' # Patient-specific progressions
]
Expected outputs:
data/interim/20251217_synthetic_data/kirc/
├── early_to_late/
│ ├── TCGA-XXX-YYY_to_TCGA-AAA-BBB.csv
│ ├── TCGA-XXX-ZZZ_to_TCGA-CCC-DDD.csv
│ └── ... # 500 trajectory files
├── trajectory_metadata.csv # Start/end points, stages
└── generation_params.json # Generation parameters
Validation:
import pandas as pd
import glob
# Count trajectories
trajectory_files = glob.glob('data/interim/*/kirc/early_to_late/*.csv')
print(f"Generated {len(trajectory_files)} trajectories")
# Expected: 500 trajectories
# Check trajectory format
traj = pd.read_csv(trajectory_files[0], index_col=0)
print(f"Trajectory shape: {traj.shape}")
# Expected: (20 timepoints, ~5000 genes)
assert traj.shape[0] == 20, "Wrong number of timepoints"
assert traj.isnull().sum().sum() == 0, "Missing values in trajectory"
print("✓ Trajectory validation passed")
Time: ~30 minutes
See: Trajectory Tutorial
Step 5: Classification¶
Script: scripts/pipeline_steps/5_classification.py
What it does: - Trains XGBoost classifier for stage prediction - Calculates SHAP values for interpretability - Identifies important gene signatures - Evaluates performance with cross-validation
Expected outputs:
models/20251217_classification_kirc/
├── classifier.pkl # Trained XGBoost model
├── classification_metrics.json # Accuracy, ROC AUC, etc.
├── shap_values.npy # SHAP importance values
├── important_genes.csv # Top biomarker genes
└── figures/
├── confusion_matrix.png
├── roc_curve.png
├── shap_summary.png
└── shap_waterfall.png
Validation:
import json
# Check performance metrics
with open('models/20251217_classification_kirc/classification_metrics.json') as f:
metrics = json.load(f)
print(f"Test Accuracy: {metrics['test_accuracy']:.3f}")
print(f"ROC AUC: {metrics['test_roc_auc']:.3f}")
# Expected performance for KIRC dataset
assert metrics['test_accuracy'] > 0.85, "Low accuracy!"
assert metrics['test_roc_auc'] > 0.90, "Low ROC AUC!"
print("✓ Classification performance acceptable")
Time: ~1 hour
See: Classification Tutorial
Step 6: Enrichment Analysis¶
Script: scripts/pipeline_steps/6_enrichment_analysis.py
What it does: - Calculates differential expression (DESeq-style) for each timepoint - Runs GSEA on ranked gene lists - Identifies enriched pathways along trajectories - Combines results into single dataset
Prerequisites: - GSEA CLI tool installed (installation guide) - ReactomePathways.gmt in data/external/
Configuration:
n_threads = 8 # Parallel processing threads
gsea_path = './GSEA_4.3.2/gsea-cli.sh'
pathways_file = 'data/external/ReactomePathways.gmt'
Expected outputs:
data/processed/20251217_enrichment/
├── deseq/
│ └── early_to_late/
│ ├── TCGA-XXX_to_YYY/
│ │ ├── patient_tp0_foldchange.rnk
│ │ ├── patient_tp1_foldchange.rnk
│ │ └── gsea_tp0/
│ │ ├── gsea_report_for_na_pos_*.tsv
│ │ └── gsea_report_for_na_neg_*.tsv
│ └── ...
└── trajectory_enrichment.csv # Final combined results
Final result format:
Patient,Idx,Transition,NAME,ES,NES,FDR q-val
TCGA-3Z-A93Z-01_to_TCGA-A3-A8OW-01,0,early_to_late,Cell Cycle,0.65,2.13,0.001
TCGA-3Z-A93Z-01_to_TCGA-A3-A8OW-01,0,early_to_late,DNA Repair,0.52,1.87,0.012
...
Validation:
import pandas as pd
# Load enrichment results
enrichment = pd.read_csv('data/processed/20251217_enrichment/trajectory_enrichment.csv')
print(f"Total enrichment results: {len(enrichment)}")
print(f"Unique patients: {enrichment['Patient'].nunique()}")
print(f"Unique pathways: {enrichment['NAME'].nunique()}")
# Check for significant pathways
sig_pathways = enrichment[enrichment['FDR q-val'] < 0.05]
print(f"Significant pathways (FDR < 0.05): {len(sig_pathways)}")
assert len(enrichment) > 0, "No enrichment results!"
assert enrichment['Patient'].nunique() == 500, "Missing trajectories!"
print("✓ Enrichment analysis complete")
Time: 2-6 hours (depends on CPU cores)
See: Enrichment Tutorial
Checking Pipeline Outputs¶
After running the complete pipeline, verify all outputs:
Or manually check key files:
import os
import pandas as pd
# Check each step's outputs
checks = {
'Step 1': 'data/interim/preprocessed_KIRC_data/preprocessed_rnaseq.csv',
'Step 2': 'models/20251217_models_KIRC/vae/vae_model.pt',
'Step 3': 'reports/figures/reconstruction/latent_space_umap.png',
'Step 4': 'data/interim/20251217_synthetic_data/kirc/trajectory_metadata.csv',
'Step 5': 'models/20251217_classification_kirc/classifier.pkl',
'Step 6': 'data/processed/20251217_enrichment/trajectory_enrichment.csv',
}
for step, filepath in checks.items():
exists = os.path.exists(filepath)
status = "✓" if exists else "✗"
print(f"{status} {step}: {filepath}")
Expected Disk Usage¶
After completing the pipeline:
data/raw/ ~5 GB (TCGA downloads)
data/interim/ ~15 GB (trajectories + intermediate files)
data/processed/ ~10 GB (enrichment results)
models/ ~1 GB (trained models)
reports/ ~100 MB (figures)
-------------------------------------------
Total: ~31 GB
Performance Benchmarks¶
Typical runtimes on different systems:
| System | Step 2 | Step 6 | Total | Config |
|---|---|---|---|---|
| HPC Cluster | 1 hr | 1 hr | 3 hrs | 48 cores, A100 GPU |
| Desktop (2023) | 2 hrs | 2 hrs | 6 hrs | 16 cores, RTX 4080 |
| Desktop (2020) | 3 hrs | 4 hrs | 9 hrs | 8 cores, GTX 1080 |
| Laptop (2018) | 8 hrs | 6 hrs | 16 hrs | 4 cores, CPU only |
Troubleshooting¶
Pipeline Fails at Step X¶
Resume from the failing step:
Out of Disk Space¶
Clean up intermediate files:
# Remove GSEA temporary files
rm -rf data/processed/*/deseq/*/gsea_*
# Remove old dated directories
rm -rf data/interim/202412* # Keep only latest
Out of Memory¶
Reduce batch sizes and threads:
# In 2_models.py
vae_config.BATCH_SIZE = 4 # Reduce from 8
# In 6_enrichment_analysis.py
n_threads = 2 # Reduce from 8
Next Steps¶
After completing the pipeline:
- Analyze Results: Explore the enrichment results
- Create Visualizations: Generate publication figures
- Interpret Pathways: Map significant pathways to biology
- Validate Findings: Test predictions experimentally
See: - Visualization Tutorial - Result Interpretation Guide - API Reference for custom analyses
Citation¶
If you use this pipeline in your research, please cite: