API Reference¶
Complete API reference for the renalprog package. All functions, classes, and modules are documented here.
Package Structure¶
renalprog/
├── __init__.py # Package initialization
├── config.py # Configuration and paths
├── dataset.py # Data loading and splitting
├── features.py # Preprocessing and feature engineering
├── plots.py # Visualization functions
├── modeling/ # Machine learning models
│ ├── __init__.py
│ ├── models.py # VAE architectures
│ ├── train.py # Training functions
│ ├── predict.py # Prediction and inference
│ ├── trajectories.py # Trajectory generation
│ ├── classification.py # XGBoost classification
│ └── enrichment.py # GSEA integration
└── utils/ # Utility functions
├── __init__.py
├── logging.py # Logging configuration
└── helpers.py # Helper functions
Quick Links¶
Core Modules¶
- Configuration - Paths and settings
- Dataset - Data loading and processing
- Features - Preprocessing and filtering
Modeling¶
- VAE Models - Variational autoencoders
- Training - Model training
- Prediction - Inference and generation
- Trajectories - Synthetic progression
- Classification - Stage prediction
- Enrichment - Pathway analysis
Visualization¶
- Plots - All plotting functions
Utilities¶
- Utils - Helper functions and logging
Installation¶
Or for development:
Basic Usage¶
Import the Package¶
Load Data¶
# Load preprocessed data
rnaseq = dataset.load_data('data/processed/rnaseq_maha.csv')
clinical = dataset.load_data('data/processed/clinical.csv')
# Create train/test split
X_train, X_test, y_train, y_test = dataset.create_train_test_split(
rnaseq_path='data/processed/rnaseq_maha.csv',
clinical_path='data/processed/clinical.csv',
test_size=0.2,
seed=2023
)
Train a Model¶
from renalprog.modeling import VAE, train_vae
# Initialize VAE
vae = VAE(
input_dim=X_train.shape[1],
latent_dim=256,
hidden_dims=[512, 256]
)
# Train
vae.fit(X_train, epochs=100, batch_size=32)
# Save
vae.save('models/my_vae.pt')
Generate Trajectories¶
from renalprog.modeling import generate_trajectories
trajectories = generate_trajectories(
vae=vae,
start_samples=early_stage_samples,
end_samples=late_stage_samples,
n_trajectories=100,
n_timepoints=20
)
Run Classification¶
from renalprog.modeling.classification import train_stage_classifier
clf, metrics, shap_values = train_stage_classifier(
X=X_train,
y=y_train,
feature_names=gene_names,
n_folds=5
)
Visualize¶
from renalprog import plots
# Plot latent space
plots.plot_latent_space(
latent_data,
color_by='stage',
method='umap',
save_path='figures/latent_umap.png'
)
# Plot training history
plots.plot_training_history(
history,
save_path='figures/training.png'
)
Module Details¶
renalprog.config¶
Configuration module with paths and constants.
Key Components:
PATHS: Dictionary of all project pathsVAEConfig: VAE hyperparameter configurationget_dated_dir(): Create dated output directoriesKIRCPaths: KIRC-specific data paths
Example:
from renalprog.config import PATHS, VAEConfig
# Access paths
data_dir = PATHS['data']
models_dir = PATHS['models']
# Configure VAE
config = VAEConfig()
config.LATENT_DIM = 256
config.EPOCHS = 600
renalprog.dataset¶
Data loading, processing, and splitting utilities.
Key Functions:
download_data(): Download TCGA data from Xenaprocess_downloaded_data(): Process raw TCGA filesload_data(): Load CSV filescreate_train_test_split(): Stratified train/test split
Example:
from renalprog import dataset
# Download data
rnaseq, clinical, pheno = dataset.download_data(
destination='data/raw',
cancer_type='KIRC'
)
# Create split
X_train, X_test, y_train, y_test = dataset.create_train_test_split(
rnaseq_path='data/processed/rnaseq.csv',
clinical_path='data/processed/clinical.csv',
test_size=0.2,
seed=2023
)
See: Dataset Reference
renalprog.features¶
Preprocessing and feature engineering.
Key Functions:
preprocess_rnaseq(): Filter and normalize gene expressionfilter_low_expression(): Remove lowly expressed genesdetect_outliers(): Mahalanobis distance outlier detectionnormalize(): Various normalization methods
Example:
from renalprog import features
# Preprocess RNA-seq data
processed, info = features.preprocess_rnaseq(
data=rnaseq,
filter_expression=True,
detect_outliers=True,
mean_threshold=0.5,
var_threshold=0.5,
alpha=0.05
)
See: Features Reference
renalprog.modeling¶
Machine learning models and training.
Submodules:
models: VAE architecturestrain: Training functionspredict: Inference and generationtrajectories: Synthetic trajectory generationclassification: XGBoost stage classificationenrichment: GSEA integration
Example:
from renalprog.modeling import VAE, train_vae, generate_trajectories
# Train VAE
vae, history = train_vae(
X_train=X_train,
X_test=X_test,
config=vae_config
)
# Generate trajectories
trajectories = generate_trajectories(
vae=vae,
start_samples=early_samples,
end_samples=late_samples,
n_trajectories=100
)
See: - Models Reference - Training Reference - Trajectories Reference - Classification Reference - Enrichment Reference
renalprog.plots¶
Visualization functions for all analysis steps.
Key Functions:
plot_latent_space(): UMAP/t-SNE visualizationplot_training_history(): Loss curvesplot_reconstruction(): Original vs reconstructedplot_trajectories(): Trajectory visualizationplot_classification_metrics(): ROC, confusion matrixplot_enrichment_heatmap(): Pathway enrichment
Example:
from renalprog import plots
# Visualize latent space
plots.plot_latent_space(
latent_data,
color_by='stage',
method='umap',
save_path='figures/latent.png',
figsize=(10, 8),
title='VAE Latent Space'
)
See: Plots Reference
renalprog.utils¶
Utility functions and helpers.
Key Functions:
configure_logging(): Set up loggingset_seed(): Set random seeds for reproducibilityTimer: Context manager for timing codecheck_file_exists(): File validation
Example:
from renalprog.utils import configure_logging, set_seed, Timer
# Configure logging
configure_logging(level='INFO')
# Set random seed
set_seed(2023)
# Time code execution
with Timer("Training"):
vae.fit(X_train, epochs=100)
See: Utils Reference
Advanced Topics¶
Custom VAE Architectures¶
from renalprog.modeling.models import BaseVAE
import torch.nn as nn
class CustomVAE(BaseVAE):
def __init__(self, input_dim, latent_dim):
super().__init__(input_dim, latent_dim)
# Custom encoder
self.encoder = nn.Sequential(
nn.Linear(input_dim, 1024),
nn.BatchNorm1d(1024),
nn.ReLU(),
nn.Dropout(0.2),
nn.Linear(1024, 512),
nn.ReLU(),
)
self.fc_mu = nn.Linear(512, latent_dim)
self.fc_logvar = nn.Linear(512, latent_dim)
# Custom decoder
self.decoder = nn.Sequential(
nn.Linear(latent_dim, 512),
nn.ReLU(),
nn.Linear(512, 1024),
nn.ReLU(),
nn.Linear(1024, input_dim)
)
Custom Preprocessing¶
from renalprog.features import preprocess_rnaseq
def custom_preprocess(data, **kwargs):
"""Custom preprocessing pipeline."""
# Step 1: Log transform
data_log = np.log2(data + 1)
# Step 2: Quantile normalization
data_norm = quantile_normalize(data_log)
# Step 3: Standard preprocessing
data_processed, info = preprocess_rnaseq(
data_norm,
**kwargs
)
return data_processed, info
Batch Processing¶
from renalprog.modeling import generate_trajectories
import glob
# Process multiple experiments
for experiment_dir in glob.glob('data/interim/experiment_*'):
vae = VAE.load(f'{experiment_dir}/vae_model.pt')
trajectories = generate_trajectories(
vae=vae,
start_samples=early,
end_samples=late,
n_trajectories=100
)
trajectories.to_csv(f'{experiment_dir}/trajectories.csv')
Type Hints¶
The package uses type hints throughout:
from typing import Tuple, Optional, Dict, List
import pandas as pd
import numpy as np
import torch
def preprocess_rnaseq(
data: pd.DataFrame,
filter_expression: bool = True,
mean_threshold: float = 0.5,
var_threshold: float = 0.5,
detect_outliers: bool = True,
alpha: float = 0.05,
seed: Optional[int] = None
) -> Tuple[pd.DataFrame, Dict[str, any]]:
"""Preprocess RNA-seq data."""
pass
Error Handling¶
Common exceptions:
from renalprog.dataset import load_data
try:
data = load_data('nonexistent_file.csv')
except FileNotFoundError as e:
print(f"File not found: {e}")
except pd.errors.EmptyDataError:
print("File is empty")
except Exception as e:
print(f"Unexpected error: {e}")
Performance¶
GPU Acceleration¶
# Use GPU if available
device = 'cuda' if torch.cuda.is_available() else 'cpu'
vae = VAE(
input_dim=5000,
latent_dim=256,
device=device
)
Parallel Processing¶
from renalprog.enrichment import run_gsea_parallel
# Use multiple cores
run_gsea_parallel(
deseq_dir='data/processed/deseq',
n_threads=8
)
Testing¶
Run the test suite:
# All tests
pytest
# Specific module
pytest tests/test_dataset.py
# With coverage
pytest --cov=renalprog --cov-report=html
Contributing¶
See Contributing Guidelines for development information.
Version History¶
- v0.1.0 (2024-12): Initial release
- VAE training pipeline
- Trajectory generation
- XGBoost classification
- GSEA integration
License¶
Apache License 2.0. See LICENSE.