Stage Classification Pipeline¶

This document describes the cancer stage classification pipeline in renalprog.

Overview¶

The classification pipeline trains and evaluates XGBoost models to classify cancer samples as early-stage or late-stage based on gene expression data. This is a critical component for validating synthetic trajectories generated by the VAE model.

Purpose¶

Stage classification serves two main purposes in the renalprog pipeline:

Validation: Verify that the VAE has learned meaningful representations that separate cancer stages
Trajectory Assessment: Evaluate synthetic trajectories by classifying intermediate timepoints to track progression

Pipeline Steps¶

1. Data Preparation¶

from renalprog.modeling.classification import prepare_classification_data

# Prepare training and test sets
X_train, X_test, y_train, y_test = prepare_classification_data(
    rnaseq_path="data/interim/preprocessed_KIRC/rnaseq_preprocessed.csv",
    clinical_path="data/interim/preprocessed_KIRC/clinical.csv",
    test_size=0.2,
    seed=2023
)

2. Model Training¶

from renalprog.modeling.classification import train_xgboost_classifier

# Train XGBoost classifier
model, metrics = train_xgboost_classifier(
    X_train, y_train,
    X_test, y_test,
    output_dir="models/classification"
)

print(f"Test Accuracy: {metrics['test_accuracy']:.3f}")
print(f"Test F1-Score: {metrics['test_f1']:.3f}")

3. Feature Importance Analysis¶

from renalprog.modeling.classification import get_feature_importance

# Get top important genes
importance_df = get_feature_importance(model, feature_names=gene_names)
top_genes = importance_df.head(50)

# Save for downstream analysis
top_genes.to_csv("data/external/important_genes_shap.csv", index=False)

4. Model Evaluation¶

The classifier is evaluated on: - Accuracy: Overall correct predictions - Precision: Correct positive predictions - Recall: Ability to find all positive cases - F1-Score: Harmonic mean of precision and recall - ROC-AUC: Area under the ROC curve - Confusion Matrix: True/false positives and negatives

5. Trajectory Classification¶

Apply the trained classifier to synthetic trajectories:

from renalprog.modeling.classification import classify_trajectories

# Classify each timepoint in trajectories
predictions = classify_trajectories(
    model=model,
    trajectories=synthetic_trajectories,
    threshold=0.5
)

# Analyze progression patterns
progression_rate = calculate_progression_rate(predictions)

Model Architecture¶

XGBoost Configuration¶

Default hyperparameters:

{
    'n_estimators': 100,
    'max_depth': 6,
    'learning_rate': 0.1,
    'subsample': 0.8,
    'colsample_bytree': 0.8,
    'objective': 'binary:logistic',
    'eval_metric': 'logloss',
    'random_state': 2023
}

Feature Selection¶

Two modes of operation:

All Genes: Use complete gene expression profile (~20,000 genes)
Important Genes: Use SHAP-selected important features (~500-1000 genes)

Using important genes typically: - Improves interpretability - Reduces overfitting - Maintains comparable performance - Speeds up inference

Output Files¶

The classification pipeline generates:

models/classification/
├── xgboost_model.json           # Trained XGBoost model
├── feature_importance.csv       # SHAP feature importance scores
├── classification_report.txt    # Detailed metrics
├── confusion_matrix.png        # Visualization
├── roc_curve.png               # ROC curve plot
└── training_history.csv        # Training metrics per epoch

Usage Example¶

Complete Classification Workflow¶

from renalprog.modeling.classification import (
    prepare_classification_data,
    train_xgboost_classifier,
    get_feature_importance,
    classify_trajectories
)
from renalprog.config import get_dated_dir, MODELS_DIR

# 1. Prepare data
X_train, X_test, y_train, y_test = prepare_classification_data(
    rnaseq_path="data/interim/preprocessed_KIRC/rnaseq_preprocessed.csv",
    clinical_path="data/interim/preprocessed_KIRC/clinical.csv",
    test_size=0.2,
    seed=2023
)

# 2. Train model
output_dir = get_dated_dir(MODELS_DIR, "classification_kirc")
model, metrics = train_xgboost_classifier(
    X_train, y_train,
    X_test, y_test,
    output_dir=output_dir
)

# 3. Get important features
importance_df = get_feature_importance(model, X_train.columns)
importance_df.to_csv(f"{output_dir}/feature_importance.csv", index=False)

# 4. Classify trajectories
trajectory_predictions = classify_trajectories(
    model=model,
    trajectories=synthetic_data,
    threshold=0.5
)

print(f"Model saved to: {output_dir}")
print(f"Test F1-Score: {metrics['test_f1']:.3f}")

Script Usage¶

# Run classification pipeline
python scripts/pipeline_steps/5_classification.py \
  --input_dir data/interim/20251216_train_test_split \
  --output_dir models/20251216_classification_kirc \
  --use_important_genes \
  --n_important 1000

Advanced Topics¶

Hyperparameter Tuning¶

For optimal performance, tune hyperparameters using cross-validation:

from sklearn.model_selection import GridSearchCV

param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [4, 6, 8],
    'learning_rate': [0.01, 0.1, 0.3],
    'subsample': [0.7, 0.8, 0.9],
    'colsample_bytree': [0.7, 0.8, 0.9]
}

grid_search = GridSearchCV(
    XGBClassifier(),
    param_grid,
    cv=5,
    scoring='f1',
    n_jobs=-1
)

grid_search.fit(X_train, y_train)
best_model = grid_search.best_estimator_

Feature Selection Strategies¶

SHAP-based: Use SHAP values to identify important features
Mutual Information: Information gain between features and target
Recursive Feature Elimination: Iteratively remove least important features
Variance Threshold: Remove low-variance features

Handling Imbalanced Data¶

If stage distribution is imbalanced:

from xgboost import XGBClassifier

# Calculate scale_pos_weight
scale_pos_weight = (y_train == 0).sum() / (y_train == 1).sum()

model = XGBClassifier(
    scale_pos_weight=scale_pos_weight,
    # ... other parameters
)

References¶

Chen, T., & Guestrin, C. (2016). XGBoost: A scalable tree boosting system. In Proceedings of the 22^nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 785-794).
Lundberg, S. M., & Lee, S. I. (2017). A unified approach to interpreting model predictions. In Advances in Neural Information Processing Systems (pp. 4765-4774).
TCGA Research Network. (2013). Comprehensive molecular characterization of clear cell renal cell carcinoma. Nature, 499(7456), 43-49.