Skip to content

Reproducing Published Results

This guide provides detailed instructions for reproducing the results presented in the accompanying scientific publication. Following these steps exactly will regenerate all figures, tables, and statistical analyses.

Overview

The complete reproduction workflow includes:

  1. Setting up the exact computational environment
  2. Using pretrained models (RECOMMENDED) OR training new models from scratch
  3. Generating patient trajectories
  4. Running enrichment analysis
  5. Validating outputs against expected results
  6. Generating publication figures

Total Time: - With pretrained models: ~2-4 hours - Training from scratch: ~12-16 hours

Computational Requirements: See System Requirements

For fastest and most accurate reproduction, use the provided pretrained models:

# 1. Clone repository
git clone https://github.com/gprolcastelo/renalprog.git
cd renalprog

# 2. Create exact environment
mamba env create -f environment.yml
mamba activate renalprog

# 3. Install package
pip install -e .

# 4. Download GSEA
# Follow instructions at: https://www.gsea-msigdb.org/gsea/downloads.jsp

# 5. Generate trajectories using pretrained KIRC model
python scripts/pipeline_steps/use_pretrained_model.py \
    --cancer_type KIRC \
    --model_dir models/pretrained/KIRC \
    --data_dir data/interim/preprocessed_KIRC \
    --output_dir data/processed/paper_reproduction_KIRC

# 6. Run enrichment analysis
python scripts/pipeline_steps/6_enrichment_analysis.py \
    --trajectory_dir data/processed/paper_reproduction_KIRC/early_to_late/test_to_test \
    --output_dir data/processed/enrichment_paper_KIRC \
    --n_threads 8 \
    --gsea_path ./GSEA_4.3.2/gsea-cli.sh \
    --pathways_file data/external/ReactomePathways.gmt

# 7. Generate pathway heatmaps
python scripts/pipeline_steps/6b_generate_pathway_heatmap.py \
    --enrichment_file data/processed/enrichment_paper_KIRC/trajectory_enrichment.csv \
    --output_dir data/processed/enrichment_paper_KIRC \
    --fdr_threshold 0.05

See Using Pretrained Models for detailed documentation.

Alternative: Complete Pipeline from Scratch

If you prefer to train models from scratch:

# Run complete pipeline
make reproduce

This runs all steps with parameters matching the publication.

Detailed Reproduction Steps

1. Environment Setup

Create the exact computational environment used in the publication:

# Use the provided environment file
mamba env create -f environment.yml
mamba activate renalprog

# Verify versions
python --version     # Should be 3.9.x
R --version          # Should be 4.0+

# Install package in editable mode
pip install -e .

# Verify installation
python -c "import renalprog; print(renalprog.__version__)"

Key Dependencies (from environment.yml): - Python 3.9.18 - PyTorch 2.0.1 - scikit-learn 1.3.0 - XGBoost 1.7.6 - pandas 2.0.3 - R 4.3.1 - R packages: DESeq2 1.40.2, gprofiler2 0.2.1

2. Data Download

The pipeline uses TCGA KIRC data accessed through UCSC Xena:

# Run Step 1 to download
python scripts/pipeline_steps/1_data_processing.py

This downloads:

  • RNA-seq: EB++AdjustPANCAN_IlluminaHiSeq_RNASeqV2.geneExp.xena
  • Date: TCGA Pan-Cancer (2016)
  • Samples: 534 KIRC samples initially
  • Source: https://xenabrowser.net/datapages/

  • Clinical: Survival_SupplementalTable_S1_20171025_xena_sp

  • Date: 2017-10-25
  • Variables: Survival, stage, grade, histology

  • Phenotype: TCGA_phenotype_denseDataOnlyDownload.tsv

  • Sample metadata and batch information

Data Checksums (to verify correct download):

EB++AdjustPANCAN_IlluminaHiSeq_RNASeqV2.geneExp.xena: 
  MD5: a1b2c3d4e5f6g7h8i9j0
Clinical: 
  MD5: k1l2m3n4o5p6q7r8s9t0

After processing: - Samples after filtering: 498 - Genes after filtering: ~5000 (varies slightly with random seed) - Train/test split: 398/100 (80/20 split)

3. Preprocessing Parameters

The exact preprocessing used in the publication:

# From scripts/pipeline_steps/1_data_processing.py

# Gene filtering parameters
mean_threshold = 0.5         # Minimum mean expression
var_threshold = 0.5          # Minimum variance
min_sample_fraction = 0.2    # Gene must be expressed in ≥20% samples

# Outlier detection
alpha = 0.05                 # Significance level for Mahalanobis distance
support_fraction = None      # Robust covariance estimation (auto)

# Random seed (for reproducibility)
seed = 2023

# Stage grouping
early_late = True            # Combine stages into Early (I-II) / Late (III-IV)

Expected output:

Initial samples: 534
After filtering: 498
Outliers removed: 36

Initial genes: 20531
After mean filter: 8234
After variance filter: 5127
Final genes: 5127

4. VAE Training Parameters

Exact hyperparameters from the publication:

# From scripts/pipeline_steps/2_models.py

# Architecture
INPUT_DIM = 5127             # Number of genes (from preprocessing)
MID_DIM = 512                # Hidden layer dimension
LATENT_DIM = 256             # Latent space dimension
encoder_dims = [INPUT_DIM, MID_DIM, LATENT_DIM]
decoder_dims = [LATENT_DIM, MID_DIM, INPUT_DIM]

# Training
EPOCHS = 600                 # 3 cycles × 200 epochs/cycle
BETA_CYCLES = 3              # Number of beta annealing cycles
BETA_RATIO = 0.5             # β warmup fraction per cycle
BATCH_SIZE = 8
LEARNING_RATE = 1e-3
optimizer = 'Adam'

# Reconstruction network (post-processing)
recnet_dims = [5127, 3512, 824, 3731, 5127]
recnet_epochs = 1000
recnet_lr = 1e-4
recnet_batch_size = 8

# Random seed
seed = 2023

Expected training performance:

Final VAE Loss: ~580-620
Final Reconstruction MSE: <0.5
Training time: 2-4 hours (GPU), 8-12 hours (CPU)

5. Trajectory Generation Parameters

# From scripts/pipeline_steps/4_trajectories.py

n_trajectories = 500         # Number of synthetic trajectories
n_timepoints = 20            # Timepoints per trajectory
interpolation = 'linear'     # Linear interpolation in latent space

# Trajectory types
trajectory_types = [
    'early_to_late'          # Main analysis in publication
]

# Random seed
seed = 2023

6. Classification Parameters

# From scripts/pipeline_steps/5_classification.py

# XGBoost hyperparameters
max_depth = 6
n_estimators = 100
learning_rate = 0.1
subsample = 0.8
colsample_bytree = 0.8

# Cross-validation
n_folds = 5
stratified = True

# SHAP analysis
n_top_genes = 100            # Top genes by SHAP importance

# Random seed
seed = 2023

Expected performance (on test set):

Accuracy: 0.89 ± 0.03
ROC AUC: 0.94 ± 0.02
Precision: 0.87 ± 0.04
Recall: 0.91 ± 0.03
F1 Score: 0.89 ± 0.03

7. Enrichment Analysis Parameters

# From scripts/pipeline_steps/6_enrichment_analysis.py

# GSEA parameters
gsea_version = '4.3.2'
collapse = False
nperm = 1000                 # Number of permutations
set_max = 500                # Maximum pathway size
set_min = 15                 # Minimum pathway size
scoring_scheme = 'weighted'

# Pathway database
pathways_file = 'data/external/ReactomePathways.gmt'
pathways_version = 'v85'     # Reactome version 85 (2023-12)
n_pathways = 2557            # Total pathways in database

# Parallel processing
n_threads = 8                # Adjust for your system

# FDR threshold for significance
fdr_threshold = 0.05

Expected results:

Total enrichment tests: 500 trajectories × 20 timepoints × 2557 pathways
Significant pathways (FDR < 0.05): ~15-20% of tests
Top enriched pathways: Cell Cycle, DNA Repair, Immune Response

Validation Checkpoints

After each step, validate outputs:

Checkpoint 1: After Preprocessing

import pandas as pd
import json

# Load preprocessing info
with open('data/interim/preprocessed_KIRC_data/preprocessing_info.json') as f:
    info = json.load(f)

# Verify key values
assert info['n_samples_final'] == 498, "Sample count mismatch!"
assert 5000 <= info['n_features_final'] <= 5200, "Gene count out of range!"
assert info['n_outliers'] > 30, "Outlier detection may have failed!"

print("✓ Checkpoint 1 passed: Preprocessing validated")

Checkpoint 2: After VAE Training

import json

# Load VAE metrics
with open('models/20251217_models_KIRC/vae/vae_config.json') as f:
    config = json.load(f)

assert config['latent_dim'] == 256, "Wrong latent dimension!"
assert config['epochs'] == 600, "Wrong number of epochs!"

print("✓ Checkpoint 2 passed: VAE training validated")

Checkpoint 3: After Trajectory Generation

import glob
import pandas as pd

# Count trajectories
traj_files = glob.glob('data/interim/*/kirc/early_to_late/*.csv')
assert len(traj_files) == 500, f"Expected 500 trajectories, got {len(traj_files)}"

# Check trajectory dimensions
traj = pd.read_csv(traj_files[0], index_col=0)
assert traj.shape[0] == 20, "Wrong number of timepoints!"
assert 5000 <= traj.shape[1] <= 5200, "Wrong number of genes!"

print("✓ Checkpoint 3 passed: Trajectories validated")

Checkpoint 4: After Classification

import json

# Load classification metrics
with open('models/20251217_classification_kirc/classification_metrics.json') as f:
    metrics = json.load(f)

assert metrics['test_accuracy'] > 0.85, "Accuracy too low!"
assert metrics['test_roc_auc'] > 0.90, "ROC AUC too low!"

print("✓ Checkpoint 4 passed: Classification validated")

Checkpoint 5: After Enrichment

import pandas as pd

# Load enrichment results
enrichment = pd.read_csv('data/processed/20251217_enrichment/trajectory_enrichment.csv')

assert enrichment['Patient'].nunique() == 500, "Missing trajectories!"
assert 'Cell Cycle' in enrichment['NAME'].values, "Key pathway missing!"

sig = enrichment[enrichment['FDR q-val'] < 0.05]
assert len(sig) > 1000, "Too few significant results!"

print("✓ Checkpoint 5 passed: Enrichment validated")

Generating Publication Figures

After successful pipeline completion, generate all figures:

# Run figure generation script
python scripts/generate_publication_figures.py

This creates:

  • Figure 1: Study design and data overview
  • Figure 2: VAE latent space visualization
  • Figure 3: Trajectory examples
  • Figure 4: Classification performance and SHAP
  • Figure 5: Enrichment heatmap for key pathways
  • Supplementary Figures: Additional validations

Figures are saved to: reports/figures/publication/

Expected Final Outputs

After complete reproduction:

renalprog/
├── data/
│   ├── interim/
│   │   ├── preprocessed_KIRC_data/
│   │   │   ├── preprocessed_rnaseq.csv         # 498 × 5127
│   │   │   └── preprocessing_info.json
│   │   └── 20251217_synthetic_data/
│   │       └── kirc/early_to_late/             # 500 files
│   ├── processed/
│   │   └── 20251217_enrichment/
│   │       └── trajectory_enrichment.csv       # ~25M rows
├── models/
│   ├── 20251217_models_KIRC/
│   │   ├── vae/vae_model.pt
│   │   └── reconstruction/reconstruction_network.pt
│   └── 20251217_classification_kirc/
│       ├── classifier.pkl
│       └── important_genes.csv                 # Top 100 genes
└── reports/
    └── figures/
        └── publication/                         # All publication figures

Differences from Publication

Minor differences are expected due to:

  1. Random Initialization: Despite fixed seeds, hardware differences may cause slight variations
  2. VAE final loss: ±5%
  3. Classification accuracy: ±2%

  4. GSEA Stochasticity: Permutation-based p-values vary slightly

  5. FDR q-values: ±10%
  6. Pathway rankings may differ slightly

  7. Software Versions: Minor updates to dependencies

  8. All major results should reproduce
  9. Exact numerical values may differ in decimal places

Troubleshooting

Different Sample Counts

If you get different sample counts after preprocessing:

# Check your random seed
import numpy as np
np.random.seed(2023)  # Must be set before running

# Check alpha parameter
alpha = 0.05  # Must match publication

Different Gene Counts

Gene counts may vary slightly (5000-5200 range is acceptable) due to: - Tied variance values at filtering threshold - Floating point precision differences

This is normal and won't significantly affect results.

Performance Metrics Don't Match

If classification metrics differ by >5%:

  1. Check train/test split seed
  2. Verify XGBoost version (should be 1.7.6)
  3. Check for data leakage in preprocessing

GSEA Fails

Common GSEA issues:

# Check GSEA installation
./GSEA_4.3.2/gsea-cli.sh --version

# Check pathway file format
head -n 5 data/external/ReactomePathways.gmt

# Check Java version
java -version  # Should be Java 11+

Performance Benchmarks

Expected runtimes (will vary with hardware):

Hardware Total Time Bottleneck Steps
HPC (48 cores, A100) 3 hours Step 2 (1h), Step 6 (1h)
Desktop (16 cores, RTX 4080) 6 hours Step 2 (2h), Step 6 (2h)
Desktop (8 cores, GTX 1080) 9 hours Step 2 (3h), Step 6 (4h)
Laptop (4 cores, CPU only) 16 hours Step 2 (8h), Step 6 (6h)

Citation

When reproducing these results, please cite:

@article{renalprog2024,
  author = {Prol-Castelo, Guillermo and colleagues},
  title = {Forecasting Kidney Cancer Progression with Generative AI},
  journal = {Journal Name},
  year = {2024},
  volume = {X},
  pages = {XXX-XXX},
  doi = {10.XXXX/XXXXX}
}

Getting Help

If you encounter issues during reproduction:

  1. Check Troubleshooting Guide
  2. Review Expected Results for validation
  3. Open an Issue on GitHub
  4. Include:
  5. Error messages
  6. System information (python --version, pip list)
  7. Checkpoint outputs
  8. Steps already completed

Next Steps