Stage Classification Pipeline¶
This document describes the cancer stage classification pipeline in renalprog.
Overview¶
The classification pipeline trains and evaluates XGBoost models to classify cancer samples as early-stage or late-stage based on gene expression data. This is a critical component for validating synthetic trajectories generated by the VAE model.
Purpose¶
Stage classification serves two main purposes in the renalprog pipeline:
- Validation: Verify that the VAE has learned meaningful representations that separate cancer stages
- Trajectory Assessment: Evaluate synthetic trajectories by classifying intermediate timepoints to track progression
Pipeline Steps¶
1. Data Preparation¶
from renalprog.modeling.classification import prepare_classification_data
# Prepare training and test sets
X_train, X_test, y_train, y_test = prepare_classification_data(
rnaseq_path="data/interim/preprocessed_KIRC/rnaseq_preprocessed.csv",
clinical_path="data/interim/preprocessed_KIRC/clinical.csv",
test_size=0.2,
seed=2023
)
2. Model Training¶
from renalprog.modeling.classification import train_xgboost_classifier
# Train XGBoost classifier
model, metrics = train_xgboost_classifier(
X_train, y_train,
X_test, y_test,
output_dir="models/classification"
)
print(f"Test Accuracy: {metrics['test_accuracy']:.3f}")
print(f"Test F1-Score: {metrics['test_f1']:.3f}")
3. Feature Importance Analysis¶
from renalprog.modeling.classification import get_feature_importance
# Get top important genes
importance_df = get_feature_importance(model, feature_names=gene_names)
top_genes = importance_df.head(50)
# Save for downstream analysis
top_genes.to_csv("data/external/important_genes_shap.csv", index=False)
4. Model Evaluation¶
The classifier is evaluated on: - Accuracy: Overall correct predictions - Precision: Correct positive predictions - Recall: Ability to find all positive cases - F1-Score: Harmonic mean of precision and recall - ROC-AUC: Area under the ROC curve - Confusion Matrix: True/false positives and negatives
5. Trajectory Classification¶
Apply the trained classifier to synthetic trajectories:
from renalprog.modeling.classification import classify_trajectories
# Classify each timepoint in trajectories
predictions = classify_trajectories(
model=model,
trajectories=synthetic_trajectories,
threshold=0.5
)
# Analyze progression patterns
progression_rate = calculate_progression_rate(predictions)
Model Architecture¶
XGBoost Configuration¶
Default hyperparameters:
{
'n_estimators': 100,
'max_depth': 6,
'learning_rate': 0.1,
'subsample': 0.8,
'colsample_bytree': 0.8,
'objective': 'binary:logistic',
'eval_metric': 'logloss',
'random_state': 2023
}
Feature Selection¶
Two modes of operation:
- All Genes: Use complete gene expression profile (~20,000 genes)
- Important Genes: Use SHAP-selected important features (~500-1000 genes)
Using important genes typically: - Improves interpretability - Reduces overfitting - Maintains comparable performance - Speeds up inference
Output Files¶
The classification pipeline generates:
models/classification/
├── xgboost_model.json # Trained XGBoost model
├── feature_importance.csv # SHAP feature importance scores
├── classification_report.txt # Detailed metrics
├── confusion_matrix.png # Visualization
├── roc_curve.png # ROC curve plot
└── training_history.csv # Training metrics per epoch
Usage Example¶
Complete Classification Workflow¶
from renalprog.modeling.classification import (
prepare_classification_data,
train_xgboost_classifier,
get_feature_importance,
classify_trajectories
)
from renalprog.config import get_dated_dir, MODELS_DIR
# 1. Prepare data
X_train, X_test, y_train, y_test = prepare_classification_data(
rnaseq_path="data/interim/preprocessed_KIRC/rnaseq_preprocessed.csv",
clinical_path="data/interim/preprocessed_KIRC/clinical.csv",
test_size=0.2,
seed=2023
)
# 2. Train model
output_dir = get_dated_dir(MODELS_DIR, "classification_kirc")
model, metrics = train_xgboost_classifier(
X_train, y_train,
X_test, y_test,
output_dir=output_dir
)
# 3. Get important features
importance_df = get_feature_importance(model, X_train.columns)
importance_df.to_csv(f"{output_dir}/feature_importance.csv", index=False)
# 4. Classify trajectories
trajectory_predictions = classify_trajectories(
model=model,
trajectories=synthetic_data,
threshold=0.5
)
print(f"Model saved to: {output_dir}")
print(f"Test F1-Score: {metrics['test_f1']:.3f}")
Script Usage¶
# Run classification pipeline
python scripts/pipeline_steps/5_classification.py \
--input_dir data/interim/20251216_train_test_split \
--output_dir models/20251216_classification_kirc \
--use_important_genes \
--n_important 1000
Advanced Topics¶
Hyperparameter Tuning¶
For optimal performance, tune hyperparameters using cross-validation:
from sklearn.model_selection import GridSearchCV
param_grid = {
'n_estimators': [100, 200, 300],
'max_depth': [4, 6, 8],
'learning_rate': [0.01, 0.1, 0.3],
'subsample': [0.7, 0.8, 0.9],
'colsample_bytree': [0.7, 0.8, 0.9]
}
grid_search = GridSearchCV(
XGBClassifier(),
param_grid,
cv=5,
scoring='f1',
n_jobs=-1
)
grid_search.fit(X_train, y_train)
best_model = grid_search.best_estimator_
Feature Selection Strategies¶
- SHAP-based: Use SHAP values to identify important features
- Mutual Information: Information gain between features and target
- Recursive Feature Elimination: Iteratively remove least important features
- Variance Threshold: Remove low-variance features
Handling Imbalanced Data¶
If stage distribution is imbalanced:
from xgboost import XGBClassifier
# Calculate scale_pos_weight
scale_pos_weight = (y_train == 0).sum() / (y_train == 1).sum()
model = XGBClassifier(
scale_pos_weight=scale_pos_weight,
# ... other parameters
)
See Also¶
- Trajectory Generation - Generate synthetic cancer trajectories
- Enrichment Analysis - Pathway enrichment on trajectories
- API Reference - Detailed API documentation
- Tutorial - Step-by-step classification tutorial
References¶
-
Chen, T., & Guestrin, C. (2016). XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 785-794).
-
Lundberg, S. M., & Lee, S. I. (2017). A unified approach to interpreting model predictions. In Advances in Neural Information Processing Systems (pp. 4765-4774).
-
TCGA Research Network. (2013). Comprehensive molecular characterization of clear cell renal cell carcinoma. Nature, 499(7456), 43-49.