Skip to content

Stage Classification Pipeline

This document describes the cancer stage classification pipeline in renalprog.

Overview

The classification pipeline trains and evaluates XGBoost models to classify cancer samples as early-stage or late-stage based on gene expression data. This is a critical component for validating synthetic trajectories generated by the VAE model.

Purpose

Stage classification serves two main purposes in the renalprog pipeline:

  1. Validation: Verify that the VAE has learned meaningful representations that separate cancer stages
  2. Trajectory Assessment: Evaluate synthetic trajectories by classifying intermediate timepoints to track progression

Pipeline Steps

1. Data Preparation

from renalprog.modeling.classification import prepare_classification_data

# Prepare training and test sets
X_train, X_test, y_train, y_test = prepare_classification_data(
    rnaseq_path="data/interim/preprocessed_KIRC/rnaseq_preprocessed.csv",
    clinical_path="data/interim/preprocessed_KIRC/clinical.csv",
    test_size=0.2,
    seed=2023
)

2. Model Training

from renalprog.modeling.classification import train_xgboost_classifier

# Train XGBoost classifier
model, metrics = train_xgboost_classifier(
    X_train, y_train,
    X_test, y_test,
    output_dir="models/classification"
)

print(f"Test Accuracy: {metrics['test_accuracy']:.3f}")
print(f"Test F1-Score: {metrics['test_f1']:.3f}")

3. Feature Importance Analysis

from renalprog.modeling.classification import get_feature_importance

# Get top important genes
importance_df = get_feature_importance(model, feature_names=gene_names)
top_genes = importance_df.head(50)

# Save for downstream analysis
top_genes.to_csv("data/external/important_genes_shap.csv", index=False)

4. Model Evaluation

The classifier is evaluated on: - Accuracy: Overall correct predictions - Precision: Correct positive predictions - Recall: Ability to find all positive cases - F1-Score: Harmonic mean of precision and recall - ROC-AUC: Area under the ROC curve - Confusion Matrix: True/false positives and negatives

5. Trajectory Classification

Apply the trained classifier to synthetic trajectories:

from renalprog.modeling.classification import classify_trajectories

# Classify each timepoint in trajectories
predictions = classify_trajectories(
    model=model,
    trajectories=synthetic_trajectories,
    threshold=0.5
)

# Analyze progression patterns
progression_rate = calculate_progression_rate(predictions)

Model Architecture

XGBoost Configuration

Default hyperparameters:

{
    'n_estimators': 100,
    'max_depth': 6,
    'learning_rate': 0.1,
    'subsample': 0.8,
    'colsample_bytree': 0.8,
    'objective': 'binary:logistic',
    'eval_metric': 'logloss',
    'random_state': 2023
}

Feature Selection

Two modes of operation:

  1. All Genes: Use complete gene expression profile (~20,000 genes)
  2. Important Genes: Use SHAP-selected important features (~500-1000 genes)

Using important genes typically: - Improves interpretability - Reduces overfitting - Maintains comparable performance - Speeds up inference

Output Files

The classification pipeline generates:

models/classification/
├── xgboost_model.json           # Trained XGBoost model
├── feature_importance.csv       # SHAP feature importance scores
├── classification_report.txt    # Detailed metrics
├── confusion_matrix.png        # Visualization
├── roc_curve.png               # ROC curve plot
└── training_history.csv        # Training metrics per epoch

Usage Example

Complete Classification Workflow

from renalprog.modeling.classification import (
    prepare_classification_data,
    train_xgboost_classifier,
    get_feature_importance,
    classify_trajectories
)
from renalprog.config import get_dated_dir, MODELS_DIR

# 1. Prepare data
X_train, X_test, y_train, y_test = prepare_classification_data(
    rnaseq_path="data/interim/preprocessed_KIRC/rnaseq_preprocessed.csv",
    clinical_path="data/interim/preprocessed_KIRC/clinical.csv",
    test_size=0.2,
    seed=2023
)

# 2. Train model
output_dir = get_dated_dir(MODELS_DIR, "classification_kirc")
model, metrics = train_xgboost_classifier(
    X_train, y_train,
    X_test, y_test,
    output_dir=output_dir
)

# 3. Get important features
importance_df = get_feature_importance(model, X_train.columns)
importance_df.to_csv(f"{output_dir}/feature_importance.csv", index=False)

# 4. Classify trajectories
trajectory_predictions = classify_trajectories(
    model=model,
    trajectories=synthetic_data,
    threshold=0.5
)

print(f"Model saved to: {output_dir}")
print(f"Test F1-Score: {metrics['test_f1']:.3f}")

Script Usage

# Run classification pipeline
python scripts/pipeline_steps/5_classification.py \
  --input_dir data/interim/20251216_train_test_split \
  --output_dir models/20251216_classification_kirc \
  --use_important_genes \
  --n_important 1000

Advanced Topics

Hyperparameter Tuning

For optimal performance, tune hyperparameters using cross-validation:

from sklearn.model_selection import GridSearchCV

param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [4, 6, 8],
    'learning_rate': [0.01, 0.1, 0.3],
    'subsample': [0.7, 0.8, 0.9],
    'colsample_bytree': [0.7, 0.8, 0.9]
}

grid_search = GridSearchCV(
    XGBClassifier(),
    param_grid,
    cv=5,
    scoring='f1',
    n_jobs=-1
)

grid_search.fit(X_train, y_train)
best_model = grid_search.best_estimator_

Feature Selection Strategies

  1. SHAP-based: Use SHAP values to identify important features
  2. Mutual Information: Information gain between features and target
  3. Recursive Feature Elimination: Iteratively remove least important features
  4. Variance Threshold: Remove low-variance features

Handling Imbalanced Data

If stage distribution is imbalanced:

from xgboost import XGBClassifier

# Calculate scale_pos_weight
scale_pos_weight = (y_train == 0).sum() / (y_train == 1).sum()

model = XGBClassifier(
    scale_pos_weight=scale_pos_weight,
    # ... other parameters
)

See Also

References

  1. Chen, T., & Guestrin, C. (2016). XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 785-794).

  2. Lundberg, S. M., & Lee, S. I. (2017). A unified approach to interpreting model predictions. In Advances in Neural Information Processing Systems (pp. 4765-4774).

  3. TCGA Research Network. (2013). Comprehensive molecular characterization of clear cell renal cell carcinoma. Nature, 499(7456), 43-49.