Features API¶
The features module provides functions for feature engineering and quality control of gene expression data.
Overview¶
This module includes:
- Low expression gene filtering
- Mahalanobis distance outlier detection
- Gene clustering with tsfresh
- Feature extraction for trajectory analysis
Core Functions¶
filter_low_expression¶
Filter genes with low or invariant expression across samples.
filter_low_expression ¶
filter_low_expression(
data: DataFrame,
mean_threshold: float = 0.5,
var_threshold: float = 0.5,
min_sample_fraction: float = 0.2,
) -> pd.DataFrame
Filter out genes with low expression across samples.
Args: data: DataFrame with genes as rows and samples as columns mean_threshold: Minimum expression value to consider gene as expressed var_threshold: Minimum variance value to consider gene as expressed min_sample_fraction: Minimum fraction of non-expressed samples to filter gene
Returns: Filtered DataFrame with only genes meeting expression criteria
Source code in renalprog/features.py
Example Usage:
import pandas as pd
from renalprog.features import filter_low_expression
# Load raw expression data (genes × samples)
rnaseq = pd.read_csv("data/raw/KIRC_rnaseq.tsv", sep="\t", index_col=0)
print(f"Original: {rnaseq.shape[0]} genes")
# Filter low expression genes
filtered = filter_low_expression(
rnaseq,
mean_threshold=0.5, # Minimum mean expression
var_threshold=0.5, # Minimum variance
min_sample_fraction=0.2 # Maximum fraction of zero values
)
print(f"Filtered: {filtered.shape[0]} genes")
Filtering Criteria:
- Zero expression threshold: Remove genes with >20% samples at zero
- Mean expression threshold: Keep genes with mean ≥ 0.5
- Variance threshold: Keep genes with variance ≥ 0.5
detect_outliers_mahalanobis¶
Detect and remove outlier samples using robust Mahalanobis distance.
detect_outliers_mahalanobis ¶
detect_outliers_mahalanobis(
data: DataFrame,
alpha: float = 0.05,
support_fraction: float = None,
transpose: bool = False,
seed: int = 2023,
) -> Tuple[pd.DataFrame, List[str], np.ndarray]
Detect and remove outlier samples using Mahalanobis distance.
Uses Minimum Covariance Determinant (MCD) for robust covariance estimation, then identifies outliers based on Mahalanobis distance and chi-square distribution.
Args: data: DataFrame with samples as columns and genes as rows (will be transposed) alpha: Significance level for chi-square test (default: 0.05) support_fraction: Fraction of samples to use in MCD estimation (default: 0.75) transpose: Whether to transpose data before processing (default: True) seed: Random seed for reproducibility (default: 2023)
Returns: Tuple of: - cleaned_data: DataFrame with outliers removed - outlier_ids: List of outlier sample IDs - mahalanobis_distances: Array of Mahalanobis distances for all samples
Source code in renalprog/features.py
Example Usage:
import pandas as pd
from renalprog.features import detect_outliers_mahalanobis
# Load gene expression data (genes × samples)
rnaseq = pd.read_csv("data/interim/filtered_expression.csv", index_col=0)
# Detect outliers
cleaned_data, outlier_ids, mahal_distances = detect_outliers_mahalanobis(
rnaseq,
alpha=0.05, # Significance level
support_fraction=None, # Auto-determine robust subset
transpose=True, # Transpose to samples × genes
seed=42
)
print(f"Original samples: {rnaseq.shape[1]}")
print(f"Outliers detected: {len(outlier_ids)}")
print(f"Clean samples: {cleaned_data.shape[1]}")
print(f"Outlier IDs: {outlier_ids}")
How It Works:
- Minimum Covariance Determinant (MCD): Computes robust covariance estimate
- Mahalanobis Distance: Calculates distance of each sample from the robust center
- Chi-square Test: Identifies outliers exceeding chi-square threshold at significance level α
Visualization:
import matplotlib.pyplot as plt
import numpy as np
from scipy.stats import chi2
# Plot Mahalanobis distances
n_features = rnaseq.shape[0]
cutoff = chi2.ppf(1 - 0.05, n_features)
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
plt.hist(mahal_distances, bins=50, edgecolor='black')
plt.axvline(cutoff, color='red', linestyle='--', label=f'Cutoff (α=0.05)')
plt.xlabel('Mahalanobis Distance')
plt.ylabel('Count')
plt.legend()
plt.title('Distribution of Mahalanobis Distances')
plt.subplot(1, 2, 2)
plt.scatter(range(len(mahal_distances)), sorted(mahal_distances))
plt.axhline(cutoff, color='red', linestyle='--')
plt.xlabel('Sample Index (sorted)')
plt.ylabel('Mahalanobis Distance')
plt.title('Sorted Mahalanobis Distances')
plt.tight_layout()
plt.savefig('outlier_detection.png', dpi=300)
Quality Control Workflow¶
Complete QC pipeline for gene expression data:
from pathlib import Path
import pandas as pd
from renalprog.features import filter_low_expression, detect_outliers_mahalanobis
from renalprog.config import PreprocessingConfig
import logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
def qc_pipeline(rnaseq_path: Path, output_dir: Path):
"""Run complete quality control pipeline."""
# Load configuration
config = PreprocessingConfig()
# Load data
logger.info("Loading RNA-seq data...")
rnaseq = pd.read_csv(rnaseq_path, sep="\t", index_col=0)
logger.info(f"Initial shape: {rnaseq.shape}")
# Filter low expression genes
logger.info("Filtering low expression genes...")
rnaseq_filtered = filter_low_expression(
rnaseq,
mean_threshold=config.mean_threshold,
var_threshold=config.var_threshold,
min_sample_fraction=config.min_sample_fraction
)
logger.info(f"After filtering: {rnaseq_filtered.shape}")
# Detect and remove outliers
logger.info("Detecting outliers...")
rnaseq_clean, outliers, distances = detect_outliers_mahalanobis(
rnaseq_filtered,
alpha=config.outlier_alpha,
seed=config.random_state
)
logger.info(f"Outliers removed: {len(outliers)}")
logger.info(f"Final shape: {rnaseq_clean.shape}")
# Save results
output_dir.mkdir(parents=True, exist_ok=True)
rnaseq_clean.to_csv(output_dir / "expression_qc.csv")
# Save QC report
qc_report = {
'initial_genes': rnaseq.shape[0],
'initial_samples': rnaseq.shape[1],
'filtered_genes': rnaseq_filtered.shape[0],
'genes_removed': rnaseq.shape[0] - rnaseq_filtered.shape[0],
'outlier_samples': len(outliers),
'outlier_ids': outliers,
'final_genes': rnaseq_clean.shape[0],
'final_samples': rnaseq_clean.shape[1]
}
pd.DataFrame([qc_report]).to_csv(output_dir / "qc_report.csv", index=False)
return rnaseq_clean, qc_report
# Run pipeline
if __name__ == "__main__":
rnaseq_clean, report = qc_pipeline(
rnaseq_path=Path("data/raw/KIRC_rnaseq.tsv"),
output_dir=Path("data/interim/qc_results")
)
Advanced Usage¶
Custom Filtering Criteria¶
Define custom filtering thresholds for different datasets:
from renalprog.features import filter_low_expression
# Strict filtering for high-quality datasets
strict_filtered = filter_low_expression(
rnaseq,
mean_threshold=1.0,
var_threshold=1.0,
min_sample_fraction=0.1
)
# Lenient filtering for smaller datasets
lenient_filtered = filter_low_expression(
rnaseq,
mean_threshold=0.1,
var_threshold=0.1,
min_sample_fraction=0.3
)
Outlier Detection with Custom Parameters¶
Adjust sensitivity of outlier detection:
from renalprog.features import detect_outliers_mahalanobis
# Conservative (fewer outliers)
conservative, outliers_con, _ = detect_outliers_mahalanobis(
rnaseq, alpha=0.01, support_fraction=0.8
)
# Liberal (more outliers)
liberal, outliers_lib, _ = detect_outliers_mahalanobis(
rnaseq, alpha=0.10, support_fraction=0.6
)
print(f"Conservative: {len(outliers_con)} outliers")
print(f"Liberal: {len(outliers_lib)} outliers")
See Also¶
- Dataset API - Data loading and preparation
- Configuration API - Preprocessing configuration
- Data Requirements Tutorial - Data preparation guide