renalprog¶

A Python package for simulating kidney cancer progression with synthetic data generation and machine learning

_{Logo: Kidneys icons created by Smashicons - Flaticon}

Warning

This site is under construction. Some pages may be missing.

Overview¶

renalprog is a comprehensive bioinformatics pipeline for analyzing kidney cancer (KIRC) progression using deep learning and pathway enrichment analysis. The package integrates Variational Autoencoders (VAEs) with differential expression analysis and gene set enrichment to model and predict cancer progression trajectories.

Scientific Context¶

Cancer progression is a complex, dynamic process involving multiple molecular alterations across time. Traditional static analyses fail to capture the temporal dynamics of tumor evolution. renalprog addresses this challenge by:

Learning latent representations of gene expression data using deep generative models (VAEs)
Generating synthetic trajectories between cancer stages in latent space
Identifying enriched biological pathways along progression trajectories
Classifying cancer stages using interpretable machine learning

This approach enables researchers to:

Identify key biological pathways driving cancer progression
Predict patient outcomes based on molecular profiles
Generate testable hypotheses about therapeutic targets
Understand the temporal dynamics of tumor evolution

Key Features¶

🔬 Data Processing¶

Automated filtering of low-expression genes
Robust outlier detection using Mahalanobis distance
Normalization and batch effect correction
Integration with TCGA and other genomics datasets

🧠 Deep Learning Models¶

Variational Autoencoder (VAE) for unsupervised representation learning
Conditional VAE (CVAE) for stage-specific modeling
Support for custom architectures and hyperparameters
GPU acceleration for large-scale datasets

🔄 Trajectory Generation¶

Generate synthetic patient trajectories between cancer stages
Interpolation in latent space with biological constraints
Multiple trajectory types (early-to-late, stage-specific, custom)
Quality control and validation metrics

📊 Stage Classification¶

XGBoost-based classification of early vs. late stage cancer
SHAP values for feature importance and interpretability
Cross-validation and performance evaluation
Gene signature discovery

🧬 Enrichment Analysis¶

Integration with DESeq2 for differential expression
GSEA (Gene Set Enrichment Analysis) for pathway analysis
Support for Reactome, KEGG, and custom pathway databases
Parallel processing for large-scale analyses

📈 Visualization¶

Comprehensive plotting functions for all analysis steps
UMAP/t-SNE visualizations of latent space
Pathway enrichment heatmaps
Classification performance metrics
Interactive plots with Plotly

Quick Start¶

Installation¶

# Clone repository
git clone https://github.com/gprolcastelo/renalprog.git
cd renalprog

# Create environment (includes Python and R)
mamba env create -f environment.yml
mamba activate renalprog

# Install package
pip install -e .

See Installation Guide for detailed instructions.

Basic Usage¶

from renalprog import config, dataset, modeling

# Load and process data
data = dataset.load_data('data/processed/rnaseq_maha.csv')
processed = dataset.preprocess(data)

# Train VAE
vae = modeling.VAE(input_dim=processed.shape[1])
vae.train(processed, epochs=100)

# Generate trajectories
trajectories = modeling.generate_trajectories(
    vae=vae,
    start_stage='early',
    end_stage='late',
    n_samples=100
)

# Run enrichment analysis
results = modeling.enrichment_analysis(trajectories)

See Quick Start Tutorial for a complete example.

Pipeline Overview¶

The renalprog pipeline consists of six main steps:

graph LR
    A[Raw Data] --> B[1. Data Processing]
    B --> C[2. VAE Training]
    C --> D[3. Reconstruction Check]
    D --> E[4. Trajectory Generation]
    E --> F[5. Classification]
    F --> G[6. Enrichment Analysis]
    G --> H[Results & Visualization]

Step 1: Data Processing¶

Filter low-expression genes
Remove outliers using Mahalanobis distance
Normalize expression values
Prepare clinical metadata

Step 2: VAE Training¶

Train deep generative models on gene expression data
Learn low-dimensional latent representations
Validate reconstruction quality

Step 3: Reconstruction Validation¶

Assess VAE reconstruction accuracy
Visualize latent space structure
Identify potential issues

Step 4: Trajectory Generation¶

Generate synthetic patient trajectories
Interpolate between cancer stages
Export trajectory gene expression

Step 5: Classification¶

Train XGBoost classifier for stage prediction
Calculate SHAP values for interpretability
Identify important gene signatures

Step 6: Enrichment Analysis¶

Differential expression analysis with DESeq2
Pathway enrichment with GSEA
Identify biological processes along trajectories

Documentation Structure¶

📚 For New Users¶

Start with:

Installation Guide - Set up your environment
Quick Start Tutorial - Run your first analysis
Complete Pipeline Tutorial - End-to-end workflow

🔬 For Reproducibility¶

Reproduce published results:

System Requirements - Hardware and software needs
Data Preparation - Download and prepare data
Running the Pipeline - Step-by-step execution
Expected Results - Validate your outputs

🛠️ For Developers¶

Extend and customize:

API Reference - Complete function documentation
Architecture Guide - Design principles
Custom Models - Implement new architectures
Contributing Guidelines - Join development

System Requirements¶

Minimum Requirements¶

OS: Linux, macOS, or Windows (with WSL for enrichment analysis)
Python: 3.9 or higher
R: 4.0 or higher (for enrichment analysis)
RAM: 8 GB
Storage: 20 GB free space

Recommended Requirements¶

RAM: 16+ GB
CPU: 8+ cores
GPU: CUDA-capable GPU with 6+ GB VRAM (for VAE training)
Storage: 50+ GB on SSD

Software Dependencies¶

PyTorch 2.0+
scikit-learn 1.0+
XGBoost 1.5+
pandas, numpy, scipy
R packages: DESeq2, gprofiler2

See System Requirements for complete details.

Use Cases¶

Cancer Research¶

Model tumor evolution over time
Identify driver pathways in progression
Predict patient outcomes
Discover therapeutic targets

Computational Biology¶

Learn representations of high-dimensional genomics data
Generate synthetic data for validation
Integrate multi-omics datasets
Perform pathway-level analysis

Machine Learning Research¶

Apply VAEs to biological data
Develop interpretable deep learning models
Benchmark generative models
Study latent space interpolation

Citation¶

If you use renalprog in your research, please cite:

@software{renalprog2025,
  author = {Prol-Castelo, Guillermo and EVENFLOW Project},
  title = {renalprog: Simulating Kidney Cancer Progression with Generative AI},
  year = {2025},
  publisher = {GitHub},
  url = {https://github.com/gprolcastelo/renalprog}
}

See How to Cite for additional references.

License¶

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Support¶

Issues: GitHub Issues
Discussions: GitHub Discussions
Email: Contact the EVENFLOW Project team

Acknowledgments¶

This work is supported by the EVENFLOW Project and builds upon numerous open-source tools and databases. See Acknowledgments for complete credits.