renalprog¶
A Python package for simulating kidney cancer progression with synthetic data generation and machine learning
Logo: Kidneys icons created by Smashicons - Flaticon
Warning
This site is under construction. Some pages may be missing.
Overview¶
renalprog is a comprehensive bioinformatics pipeline for analyzing kidney cancer (KIRC) progression using deep learning and pathway enrichment analysis. The package integrates Variational Autoencoders (VAEs) with differential expression analysis and gene set enrichment to model and predict cancer progression trajectories.
Scientific Context¶
Cancer progression is a complex, dynamic process involving multiple molecular alterations across time. Traditional static analyses fail to capture the temporal dynamics of tumor evolution. renalprog addresses this challenge by:
- Learning latent representations of gene expression data using deep generative models (VAEs)
- Generating synthetic trajectories between cancer stages in latent space
- Identifying enriched biological pathways along progression trajectories
- Classifying cancer stages using interpretable machine learning
This approach enables researchers to:
- Identify key biological pathways driving cancer progression
- Predict patient outcomes based on molecular profiles
- Generate testable hypotheses about therapeutic targets
- Understand the temporal dynamics of tumor evolution
Key Features¶
🔬 Data Processing¶
- Automated filtering of low-expression genes
- Robust outlier detection using Mahalanobis distance
- Normalization and batch effect correction
- Integration with TCGA and other genomics datasets
🧠 Deep Learning Models¶
- Variational Autoencoder (VAE) for unsupervised representation learning
- Conditional VAE (CVAE) for stage-specific modeling
- Support for custom architectures and hyperparameters
- GPU acceleration for large-scale datasets
🔄 Trajectory Generation¶
- Generate synthetic patient trajectories between cancer stages
- Interpolation in latent space with biological constraints
- Multiple trajectory types (early-to-late, stage-specific, custom)
- Quality control and validation metrics
📊 Stage Classification¶
- XGBoost-based classification of early vs. late stage cancer
- SHAP values for feature importance and interpretability
- Cross-validation and performance evaluation
- Gene signature discovery
🧬 Enrichment Analysis¶
- Integration with DESeq2 for differential expression
- GSEA (Gene Set Enrichment Analysis) for pathway analysis
- Support for Reactome, KEGG, and custom pathway databases
- Parallel processing for large-scale analyses
📈 Visualization¶
- Comprehensive plotting functions for all analysis steps
- UMAP/t-SNE visualizations of latent space
- Pathway enrichment heatmaps
- Classification performance metrics
- Interactive plots with Plotly
Quick Start¶
Installation¶
# Clone repository
git clone https://github.com/gprolcastelo/renalprog.git
cd renalprog
# Create environment (includes Python and R)
mamba env create -f environment.yml
mamba activate renalprog
# Install package
pip install -e .
See Installation Guide for detailed instructions.
Basic Usage¶
from renalprog import config, dataset, modeling
# Load and process data
data = dataset.load_data('data/processed/rnaseq_maha.csv')
processed = dataset.preprocess(data)
# Train VAE
vae = modeling.VAE(input_dim=processed.shape[1])
vae.train(processed, epochs=100)
# Generate trajectories
trajectories = modeling.generate_trajectories(
vae=vae,
start_stage='early',
end_stage='late',
n_samples=100
)
# Run enrichment analysis
results = modeling.enrichment_analysis(trajectories)
See Quick Start Tutorial for a complete example.
Pipeline Overview¶
The renalprog pipeline consists of six main steps:
graph LR
A[Raw Data] --> B[1. Data Processing]
B --> C[2. VAE Training]
C --> D[3. Reconstruction Check]
D --> E[4. Trajectory Generation]
E --> F[5. Classification]
F --> G[6. Enrichment Analysis]
G --> H[Results & Visualization] Step 1: Data Processing¶
- Filter low-expression genes
- Remove outliers using Mahalanobis distance
- Normalize expression values
- Prepare clinical metadata
Step 2: VAE Training¶
- Train deep generative models on gene expression data
- Learn low-dimensional latent representations
- Validate reconstruction quality
Step 3: Reconstruction Validation¶
- Assess VAE reconstruction accuracy
- Visualize latent space structure
- Identify potential issues
Step 4: Trajectory Generation¶
- Generate synthetic patient trajectories
- Interpolate between cancer stages
- Export trajectory gene expression
Step 5: Classification¶
- Train XGBoost classifier for stage prediction
- Calculate SHAP values for interpretability
- Identify important gene signatures
Step 6: Enrichment Analysis¶
- Differential expression analysis with DESeq2
- Pathway enrichment with GSEA
- Identify biological processes along trajectories
Documentation Structure¶
📚 For New Users¶
Start with:
- Installation Guide - Set up your environment
- Quick Start Tutorial - Run your first analysis
- Complete Pipeline Tutorial - End-to-end workflow
🔬 For Reproducibility¶
Reproduce published results:
- System Requirements - Hardware and software needs
- Data Preparation - Download and prepare data
- Running the Pipeline - Step-by-step execution
- Expected Results - Validate your outputs
🛠️ For Developers¶
Extend and customize:
- API Reference - Complete function documentation
- Architecture Guide - Design principles
- Custom Models - Implement new architectures
- Contributing Guidelines - Join development
System Requirements¶
Minimum Requirements¶
- OS: Linux, macOS, or Windows (with WSL for enrichment analysis)
- Python: 3.9 or higher
- R: 4.0 or higher (for enrichment analysis)
- RAM: 8 GB
- Storage: 20 GB free space
Recommended Requirements¶
- RAM: 16+ GB
- CPU: 8+ cores
- GPU: CUDA-capable GPU with 6+ GB VRAM (for VAE training)
- Storage: 50+ GB on SSD
Software Dependencies¶
- PyTorch 2.0+
- scikit-learn 1.0+
- XGBoost 1.5+
- pandas, numpy, scipy
- R packages: DESeq2, gprofiler2
See System Requirements for complete details.
Use Cases¶
Cancer Research¶
- Model tumor evolution over time
- Identify driver pathways in progression
- Predict patient outcomes
- Discover therapeutic targets
Computational Biology¶
- Learn representations of high-dimensional genomics data
- Generate synthetic data for validation
- Integrate multi-omics datasets
- Perform pathway-level analysis
Machine Learning Research¶
- Apply VAEs to biological data
- Develop interpretable deep learning models
- Benchmark generative models
- Study latent space interpolation
Citation¶
If you use renalprog in your research, please cite:
@software{renalprog2025,
author = {Prol-Castelo, Guillermo and EVENFLOW Project},
title = {renalprog: Simulating Kidney Cancer Progression with Generative AI},
year = {2025},
publisher = {GitHub},
url = {https://github.com/gprolcastelo/renalprog}
}
See How to Cite for additional references.
License¶
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
Support¶
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Email: Contact the EVENFLOW Project team
Acknowledgments¶
This work is supported by the EVENFLOW Project and builds upon numerous open-source tools and databases. See Acknowledgments for complete credits.