Reproducing Published Results¶

This guide provides detailed instructions for reproducing the results presented in the accompanying scientific publication. Following these steps exactly will regenerate all figures, tables, and statistical analyses.

Overview¶

The complete reproduction workflow includes:

Setting up the exact computational environment
Using pretrained models (RECOMMENDED) OR training new models from scratch
Generating patient trajectories
Running enrichment analysis
Validating outputs against expected results
Generating publication figures

Total Time: - With pretrained models: ~2-4 hours - Training from scratch: ~12-16 hours

Computational Requirements: See System Requirements

Quick Start (Using Pretrained Models - RECOMMENDED)¶

For fastest and most accurate reproduction, use the provided pretrained models:

# 1. Clone repository
git clone https://github.com/gprolcastelo/renalprog.git
cd renalprog

# 2. Create exact environment
mamba env create -f environment.yml
mamba activate renalprog

# 3. Install package
pip install -e .

# 4. Download GSEA
# Follow instructions at: https://www.gsea-msigdb.org/gsea/downloads.jsp

# 5. Generate trajectories using pretrained KIRC model
python scripts/pipeline_steps/use_pretrained_model.py \
    --cancer_type KIRC \
    --model_dir models/pretrained/KIRC \
    --data_dir data/interim/preprocessed_KIRC \
    --output_dir data/processed/paper_reproduction_KIRC

# 6. Run enrichment analysis
python scripts/pipeline_steps/6_enrichment_analysis.py \
    --trajectory_dir data/processed/paper_reproduction_KIRC/early_to_late/test_to_test \
    --output_dir data/processed/enrichment_paper_KIRC \
    --n_threads 8 \
    --gsea_path ./GSEA_4.3.2/gsea-cli.sh \
    --pathways_file data/external/ReactomePathways.gmt

# 7. Generate pathway heatmaps
python scripts/pipeline_steps/6b_generate_pathway_heatmap.py \
    --enrichment_file data/processed/enrichment_paper_KIRC/trajectory_enrichment.csv \
    --output_dir data/processed/enrichment_paper_KIRC \
    --fdr_threshold 0.05

See Using Pretrained Models for detailed documentation.

Alternative: Complete Pipeline from Scratch¶

If you prefer to train models from scratch:

# Run complete pipeline
make reproduce

This runs all steps with parameters matching the publication.

Detailed Reproduction Steps¶

1. Environment Setup¶

Create the exact computational environment used in the publication:

# Use the provided environment file
mamba env create -f environment.yml
mamba activate renalprog

# Verify versions
python --version     # Should be 3.9.x
R --version          # Should be 4.0+

# Install package in editable mode
pip install -e .

# Verify installation
python -c "import renalprog; print(renalprog.__version__)"

Key Dependencies (from environment.yml): - Python 3.9.18 - PyTorch 2.0.1 - scikit-learn 1.3.0 - XGBoost 1.7.6 - pandas 2.0.3 - R 4.3.1 - R packages: DESeq2 1.40.2, gprofiler2 0.2.1

2. Data Download¶

The pipeline uses TCGA KIRC data accessed through UCSC Xena:

# Run Step 1 to download
python scripts/pipeline_steps/1_data_processing.py

This downloads:

RNA-seq: EB++AdjustPANCAN_IlluminaHiSeq_RNASeqV2.geneExp.xena
Date: TCGA Pan-Cancer (2016)
Samples: 534 KIRC samples initially
Source: https://xenabrowser.net/datapages/
Clinical: Survival_SupplementalTable_S1_20171025_xena_sp
Date: 2017-10-25
Variables: Survival, stage, grade, histology
Phenotype: TCGA_phenotype_denseDataOnlyDownload.tsv
Sample metadata and batch information

Data Checksums (to verify correct download):

EB++AdjustPANCAN_IlluminaHiSeq_RNASeqV2.geneExp.xena: 
  MD5: a1b2c3d4e5f6g7h8i9j0
Clinical: 
  MD5: k1l2m3n4o5p6q7r8s9t0

After processing: - Samples after filtering: 498 - Genes after filtering: ~5000 (varies slightly with random seed) - Train/test split: 398/100 (80/20 split)

3. Preprocessing Parameters¶

The exact preprocessing used in the publication:

# From scripts/pipeline_steps/1_data_processing.py

# Gene filtering parameters
mean_threshold = 0.5         # Minimum mean expression
var_threshold = 0.5          # Minimum variance
min_sample_fraction = 0.2    # Gene must be expressed in ≥20% samples

# Outlier detection
alpha = 0.05                 # Significance level for Mahalanobis distance
support_fraction = None      # Robust covariance estimation (auto)

# Random seed (for reproducibility)
seed = 2023

# Stage grouping
early_late = True            # Combine stages into Early (I-II) / Late (III-IV)

Expected output:

Initial samples: 534
After filtering: 498
Outliers removed: 36

Initial genes: 20531
After mean filter: 8234
After variance filter: 5127
Final genes: 5127

4. VAE Training Parameters¶

Exact hyperparameters from the publication:

# From scripts/pipeline_steps/2_models.py

# Architecture
INPUT_DIM = 5127             # Number of genes (from preprocessing)
MID_DIM = 512                # Hidden layer dimension
LATENT_DIM = 256             # Latent space dimension
encoder_dims = [INPUT_DIM, MID_DIM, LATENT_DIM]
decoder_dims = [LATENT_DIM, MID_DIM, INPUT_DIM]

# Training
EPOCHS = 600                 # 3 cycles × 200 epochs/cycle
BETA_CYCLES = 3              # Number of beta annealing cycles
BETA_RATIO = 0.5             # β warmup fraction per cycle
BATCH_SIZE = 8
LEARNING_RATE = 1e-3
optimizer = 'Adam'

# Reconstruction network (post-processing)
recnet_dims = [5127, 3512, 824, 3731, 5127]
recnet_epochs = 1000
recnet_lr = 1e-4
recnet_batch_size = 8

# Random seed
seed = 2023

Expected training performance:

Final VAE Loss: ~580-620
Final Reconstruction MSE: <0.5
Training time: 2-4 hours (GPU), 8-12 hours (CPU)

5. Trajectory Generation Parameters¶

# From scripts/pipeline_steps/4_trajectories.py

n_trajectories = 500         # Number of synthetic trajectories
n_timepoints = 20            # Timepoints per trajectory
interpolation = 'linear'     # Linear interpolation in latent space

# Trajectory types
trajectory_types = [
    'early_to_late'          # Main analysis in publication
]

# Random seed
seed = 2023

6. Classification Parameters¶

# From scripts/pipeline_steps/5_classification.py

# XGBoost hyperparameters
max_depth = 6
n_estimators = 100
learning_rate = 0.1
subsample = 0.8
colsample_bytree = 0.8

# Cross-validation
n_folds = 5
stratified = True

# SHAP analysis
n_top_genes = 100            # Top genes by SHAP importance

# Random seed
seed = 2023

Expected performance (on test set):

Accuracy: 0.89 ± 0.03
ROC AUC: 0.94 ± 0.02
Precision: 0.87 ± 0.04
Recall: 0.91 ± 0.03
F1 Score: 0.89 ± 0.03

7. Enrichment Analysis Parameters¶

# From scripts/pipeline_steps/6_enrichment_analysis.py

# GSEA parameters
gsea_version = '4.3.2'
collapse = False
nperm = 1000                 # Number of permutations
set_max = 500                # Maximum pathway size
set_min = 15                 # Minimum pathway size
scoring_scheme = 'weighted'

# Pathway database
pathways_file = 'data/external/ReactomePathways.gmt'
pathways_version = 'v85'     # Reactome version 85 (2023-12)
n_pathways = 2557            # Total pathways in database

# Parallel processing
n_threads = 8                # Adjust for your system

# FDR threshold for significance
fdr_threshold = 0.05

Expected results:

Total enrichment tests: 500 trajectories × 20 timepoints × 2557 pathways
Significant pathways (FDR < 0.05): ~15-20% of tests
Top enriched pathways: Cell Cycle, DNA Repair, Immune Response

Validation Checkpoints¶

After each step, validate outputs:

Checkpoint 1: After Preprocessing¶

import pandas as pd
import json

# Load preprocessing info
with open('data/interim/preprocessed_KIRC_data/preprocessing_info.json') as f:
    info = json.load(f)

# Verify key values
assert info['n_samples_final'] == 498, "Sample count mismatch!"
assert 5000 <= info['n_features_final'] <= 5200, "Gene count out of range!"
assert info['n_outliers'] > 30, "Outlier detection may have failed!"

print("✓ Checkpoint 1 passed: Preprocessing validated")

Checkpoint 2: After VAE Training¶

import json

# Load VAE metrics
with open('models/20251217_models_KIRC/vae/vae_config.json') as f:
    config = json.load(f)

assert config['latent_dim'] == 256, "Wrong latent dimension!"
assert config['epochs'] == 600, "Wrong number of epochs!"

print("✓ Checkpoint 2 passed: VAE training validated")

Checkpoint 3: After Trajectory Generation¶

import glob
import pandas as pd

# Count trajectories
traj_files = glob.glob('data/interim/*/kirc/early_to_late/*.csv')
assert len(traj_files) == 500, f"Expected 500 trajectories, got {len(traj_files)}"

# Check trajectory dimensions
traj = pd.read_csv(traj_files[0], index_col=0)
assert traj.shape[0] == 20, "Wrong number of timepoints!"
assert 5000 <= traj.shape[1] <= 5200, "Wrong number of genes!"

print("✓ Checkpoint 3 passed: Trajectories validated")

Checkpoint 4: After Classification¶

import json

# Load classification metrics
with open('models/20251217_classification_kirc/classification_metrics.json') as f:
    metrics = json.load(f)

assert metrics['test_accuracy'] > 0.85, "Accuracy too low!"
assert metrics['test_roc_auc'] > 0.90, "ROC AUC too low!"

print("✓ Checkpoint 4 passed: Classification validated")

Checkpoint 5: After Enrichment¶

import pandas as pd

# Load enrichment results
enrichment = pd.read_csv('data/processed/20251217_enrichment/trajectory_enrichment.csv')

assert enrichment['Patient'].nunique() == 500, "Missing trajectories!"
assert 'Cell Cycle' in enrichment['NAME'].values, "Key pathway missing!"

sig = enrichment[enrichment['FDR q-val'] < 0.05]
assert len(sig) > 1000, "Too few significant results!"

print("✓ Checkpoint 5 passed: Enrichment validated")

Generating Publication Figures¶

After successful pipeline completion, generate all figures:

# Run figure generation script
python scripts/generate_publication_figures.py

This creates:

Figure 1: Study design and data overview
Figure 2: VAE latent space visualization
Figure 3: Trajectory examples
Figure 4: Classification performance and SHAP
Figure 5: Enrichment heatmap for key pathways
Supplementary Figures: Additional validations

Figures are saved to: reports/figures/publication/

Expected Final Outputs¶

After complete reproduction:

renalprog/
├── data/
│   ├── interim/
│   │   ├── preprocessed_KIRC_data/
│   │   │   ├── preprocessed_rnaseq.csv         # 498 × 5127
│   │   │   └── preprocessing_info.json
│   │   └── 20251217_synthetic_data/
│   │       └── kirc/early_to_late/             # 500 files
│   ├── processed/
│   │   └── 20251217_enrichment/
│   │       └── trajectory_enrichment.csv       # ~25M rows
├── models/
│   ├── 20251217_models_KIRC/
│   │   ├── vae/vae_model.pt
│   │   └── reconstruction/reconstruction_network.pt
│   └── 20251217_classification_kirc/
│       ├── classifier.pkl
│       └── important_genes.csv                 # Top 100 genes
└── reports/
    └── figures/
        └── publication/                         # All publication figures

Differences from Publication¶

Minor differences are expected due to:

Random Initialization: Despite fixed seeds, hardware differences may cause slight variations
VAE final loss: ±5%
Classification accuracy: ±2%
GSEA Stochasticity: Permutation-based p-values vary slightly
FDR q-values: ±10%
Pathway rankings may differ slightly
Software Versions: Minor updates to dependencies
All major results should reproduce
Exact numerical values may differ in decimal places

Troubleshooting¶

Different Sample Counts¶

If you get different sample counts after preprocessing:

# Check your random seed
import numpy as np
np.random.seed(2023)  # Must be set before running

# Check alpha parameter
alpha = 0.05  # Must match publication

Different Gene Counts¶

Gene counts may vary slightly (5000-5200 range is acceptable) due to: - Tied variance values at filtering threshold - Floating point precision differences

This is normal and won't significantly affect results.

Performance Metrics Don't Match¶

If classification metrics differ by >5%:

Check train/test split seed
Verify XGBoost version (should be 1.7.6)
Check for data leakage in preprocessing

GSEA Fails¶

Common GSEA issues:

# Check GSEA installation
./GSEA_4.3.2/gsea-cli.sh --version

# Check pathway file format
head -n 5 data/external/ReactomePathways.gmt

# Check Java version
java -version  # Should be Java 11+

Performance Benchmarks¶

Expected runtimes (will vary with hardware):

Hardware	Total Time	Bottleneck Steps
HPC (48 cores, A100)	3 hours	Step 2 (1h), Step 6 (1h)
Desktop (16 cores, RTX 4080)	6 hours	Step 2 (2h), Step 6 (2h)
Desktop (8 cores, GTX 1080)	9 hours	Step 2 (3h), Step 6 (4h)
Laptop (4 cores, CPU only)	16 hours	Step 2 (8h), Step 6 (6h)

Citation¶

When reproducing these results, please cite:

@article{renalprog2024,
  author = {Prol-Castelo, Guillermo and colleagues},
  title = {Forecasting Kidney Cancer Progression with Generative AI},
  journal = {Journal Name},
  year = {2024},
  volume = {X},
  pages = {XXX-XXX},
  doi = {10.XXXX/XXXXX}
}

Getting Help¶

If you encounter issues during reproduction:

Check Troubleshooting Guide
Review Expected Results for validation
Open an Issue on GitHub
Include:
Error messages
System information (python --version, pip list)
Checkpoint outputs
Steps already completed

Next Steps¶

Expected Results: Detailed result descriptions
Troubleshooting: Common issues and solutions
System Requirements: Hardware/software needs