API Reference¶

Complete API reference for the renalprog package. All functions, classes, and modules are documented here.

Package Structure¶

renalprog/
├── __init__.py              # Package initialization
├── config.py                # Configuration and paths
├── dataset.py               # Data loading and splitting
├── features.py              # Preprocessing and feature engineering
├── plots.py                 # Visualization functions
├── modeling/                # Machine learning models
│   ├── __init__.py
│   ├── models.py           # VAE architectures
│   ├── train.py            # Training functions
│   ├── predict.py          # Prediction and inference
│   ├── trajectories.py     # Trajectory generation
│   ├── classification.py   # XGBoost classification
│   └── enrichment.py       # GSEA integration
└── utils/                   # Utility functions
    ├── __init__.py
    ├── logging.py          # Logging configuration
    └── helpers.py          # Helper functions

Quick Links¶

Core Modules¶

Configuration - Paths and settings
Dataset - Data loading and processing
Features - Preprocessing and filtering

Modeling¶

VAE Models - Variational autoencoders
Training - Model training
Prediction - Inference and generation
Trajectories - Synthetic progression
Classification - Stage prediction
Enrichment - Pathway analysis

Visualization¶

Plots - All plotting functions

Utilities¶

Utils - Helper functions and logging

Installation¶

pip install renalprog

Or for development:

git clone https://github.com/gprolcastelo/renalprog.git
cd renalprog
pip install -e .

Basic Usage¶

Import the Package¶

import renalprog
from renalprog import config, dataset, features, modeling, plots

Load Data¶

# Load preprocessed data
rnaseq = dataset.load_data('data/processed/rnaseq_maha.csv')
clinical = dataset.load_data('data/processed/clinical.csv')

# Create train/test split
X_train, X_test, y_train, y_test = dataset.create_train_test_split(
    rnaseq_path='data/processed/rnaseq_maha.csv',
    clinical_path='data/processed/clinical.csv',
    test_size=0.2,
    seed=2023
)

Train a Model¶

from renalprog.modeling import VAE, train_vae

# Initialize VAE
vae = VAE(
    input_dim=X_train.shape[1],
    latent_dim=256,
    hidden_dims=[512, 256]
)

# Train
vae.fit(X_train, epochs=100, batch_size=32)

# Save
vae.save('models/my_vae.pt')

Generate Trajectories¶

from renalprog.modeling import generate_trajectories

trajectories = generate_trajectories(
    vae=vae,
    start_samples=early_stage_samples,
    end_samples=late_stage_samples,
    n_trajectories=100,
    n_timepoints=20
)

Run Classification¶

from renalprog.modeling.classification import train_stage_classifier

clf, metrics, shap_values = train_stage_classifier(
    X=X_train,
    y=y_train,
    feature_names=gene_names,
    n_folds=5
)

Visualize¶

from renalprog import plots

# Plot latent space
plots.plot_latent_space(
    latent_data,
    color_by='stage',
    method='umap',
    save_path='figures/latent_umap.png'
)

# Plot training history
plots.plot_training_history(
    history,
    save_path='figures/training.png'
)

Module Details¶

renalprog.config¶

Configuration module with paths and constants.

Key Components:

PATHS: Dictionary of all project paths
VAEConfig: VAE hyperparameter configuration
get_dated_dir(): Create dated output directories
KIRCPaths: KIRC-specific data paths

Example:

from renalprog.config import PATHS, VAEConfig

# Access paths
data_dir = PATHS['data']
models_dir = PATHS['models']

# Configure VAE
config = VAEConfig()
config.LATENT_DIM = 256
config.EPOCHS = 600

See: Configuration Reference

renalprog.dataset¶

Data loading, processing, and splitting utilities.

Key Functions:

download_data(): Download TCGA data from Xena
process_downloaded_data(): Process raw TCGA files
load_data(): Load CSV files
create_train_test_split(): Stratified train/test split

Example:

from renalprog import dataset

# Download data
rnaseq, clinical, pheno = dataset.download_data(
    destination='data/raw',
    cancer_type='KIRC'
)

# Create split
X_train, X_test, y_train, y_test = dataset.create_train_test_split(
    rnaseq_path='data/processed/rnaseq.csv',
    clinical_path='data/processed/clinical.csv',
    test_size=0.2,
    seed=2023
)

See: Dataset Reference

renalprog.features¶

Preprocessing and feature engineering.

Key Functions:

preprocess_rnaseq(): Filter and normalize gene expression
filter_low_expression(): Remove lowly expressed genes
detect_outliers(): Mahalanobis distance outlier detection
normalize(): Various normalization methods

Example:

from renalprog import features

# Preprocess RNA-seq data
processed, info = features.preprocess_rnaseq(
    data=rnaseq,
    filter_expression=True,
    detect_outliers=True,
    mean_threshold=0.5,
    var_threshold=0.5,
    alpha=0.05
)

See: Features Reference

renalprog.modeling¶

Machine learning models and training.

Submodules:

models: VAE architectures
train: Training functions
predict: Inference and generation
trajectories: Synthetic trajectory generation
classification: XGBoost stage classification
enrichment: GSEA integration

Example:

from renalprog.modeling import VAE, train_vae, generate_trajectories

# Train VAE
vae, history = train_vae(
    X_train=X_train,
    X_test=X_test,
    config=vae_config
)

# Generate trajectories
trajectories = generate_trajectories(
    vae=vae,
    start_samples=early_samples,
    end_samples=late_samples,
    n_trajectories=100
)

See: - Models Reference - Training Reference - Trajectories Reference - Classification Reference - Enrichment Reference

renalprog.plots¶

Visualization functions for all analysis steps.

Key Functions:

plot_latent_space(): UMAP/t-SNE visualization
plot_training_history(): Loss curves
plot_reconstruction(): Original vs reconstructed
plot_trajectories(): Trajectory visualization
plot_classification_metrics(): ROC, confusion matrix
plot_enrichment_heatmap(): Pathway enrichment

Example:

from renalprog import plots

# Visualize latent space
plots.plot_latent_space(
    latent_data,
    color_by='stage',
    method='umap',
    save_path='figures/latent.png',
    figsize=(10, 8),
    title='VAE Latent Space'
)

See: Plots Reference

renalprog.utils¶

Utility functions and helpers.

Key Functions:

configure_logging(): Set up logging
set_seed(): Set random seeds for reproducibility
Timer: Context manager for timing code
check_file_exists(): File validation

Example:

from renalprog.utils import configure_logging, set_seed, Timer

# Configure logging
configure_logging(level='INFO')

# Set random seed
set_seed(2023)

# Time code execution
with Timer("Training"):
    vae.fit(X_train, epochs=100)

See: Utils Reference

Advanced Topics¶

Custom VAE Architectures¶

from renalprog.modeling.models import BaseVAE
import torch.nn as nn

class CustomVAE(BaseVAE):
    def __init__(self, input_dim, latent_dim):
        super().__init__(input_dim, latent_dim)

        # Custom encoder
        self.encoder = nn.Sequential(
            nn.Linear(input_dim, 1024),
            nn.BatchNorm1d(1024),
            nn.ReLU(),
            nn.Dropout(0.2),
            nn.Linear(1024, 512),
            nn.ReLU(),
        )

        self.fc_mu = nn.Linear(512, latent_dim)
        self.fc_logvar = nn.Linear(512, latent_dim)

        # Custom decoder
        self.decoder = nn.Sequential(
            nn.Linear(latent_dim, 512),
            nn.ReLU(),
            nn.Linear(512, 1024),
            nn.ReLU(),
            nn.Linear(1024, input_dim)
        )

Custom Preprocessing¶

from renalprog.features import preprocess_rnaseq

def custom_preprocess(data, **kwargs):
    """Custom preprocessing pipeline."""

    # Step 1: Log transform
    data_log = np.log2(data + 1)

    # Step 2: Quantile normalization
    data_norm = quantile_normalize(data_log)

    # Step 3: Standard preprocessing
    data_processed, info = preprocess_rnaseq(
        data_norm,
        **kwargs
    )

    return data_processed, info

Batch Processing¶

from renalprog.modeling import generate_trajectories
import glob

# Process multiple experiments
for experiment_dir in glob.glob('data/interim/experiment_*'):
    vae = VAE.load(f'{experiment_dir}/vae_model.pt')

    trajectories = generate_trajectories(
        vae=vae,
        start_samples=early,
        end_samples=late,
        n_trajectories=100
    )

    trajectories.to_csv(f'{experiment_dir}/trajectories.csv')

Type Hints¶

The package uses type hints throughout:

from typing import Tuple, Optional, Dict, List
import pandas as pd
import numpy as np
import torch

def preprocess_rnaseq(
    data: pd.DataFrame,
    filter_expression: bool = True,
    mean_threshold: float = 0.5,
    var_threshold: float = 0.5,
    detect_outliers: bool = True,
    alpha: float = 0.05,
    seed: Optional[int] = None
) -> Tuple[pd.DataFrame, Dict[str, any]]:
    """Preprocess RNA-seq data."""
    pass

Error Handling¶

Common exceptions:

from renalprog.dataset import load_data

try:
    data = load_data('nonexistent_file.csv')
except FileNotFoundError as e:
    print(f"File not found: {e}")
except pd.errors.EmptyDataError:
    print("File is empty")
except Exception as e:
    print(f"Unexpected error: {e}")

Performance¶

GPU Acceleration¶

# Use GPU if available
device = 'cuda' if torch.cuda.is_available() else 'cpu'

vae = VAE(
    input_dim=5000,
    latent_dim=256,
    device=device
)

Parallel Processing¶

from renalprog.enrichment import run_gsea_parallel

# Use multiple cores
run_gsea_parallel(
    deseq_dir='data/processed/deseq',
    n_threads=8
)

Testing¶

Run the test suite:

# All tests
pytest

# Specific module
pytest tests/test_dataset.py

# With coverage
pytest --cov=renalprog --cov-report=html

Contributing¶

See Contributing Guidelines for development information.

Version History¶

v0.1.0 (2024-12): Initial release
VAE training pipeline
Trajectory generation
XGBoost classification
GSEA integration

License¶

Apache License 2.0. See LICENSE.

Citation¶

@software{renalprog2024,
  author = {Prol-Castelo, Guillermo},
  title = {renalprog: Cancer Progression Forecasting with Generative AI},
  year = {2024},
  url = {https://github.com/gprolcastelo/renalprog}
}