Dynamic Enrichment Analysis¶
This document describes the dynamic enrichment analysis pipeline for renalprog.
Overview¶
The enrichment analysis pipeline performs Gene Set Enrichment Analysis (GSEA) on synthetic cancer progression trajectories. This allows us to identify biological pathways that are enriched at different stages of progression.
Pipeline Steps¶
1. pyDESeq2 Differential Expression Analysis¶
For each synthetic trajectory timepoint:
- Load trajectory gene expression data (reverse log-transform from RSEM)
- Load healthy control samples (reverse log-transform from RSEM)
- Run pyDESeq2 differential expression analysis comparing trajectory vs controls
- Extract log2 fold-change and adjusted p-values for each gene
- Rank genes by log2 fold-change
- Save ranked gene list (
.rnkfile) for GSEA
Note: The pipeline uses PyDESeq2 for proper differential expression analysis, not simple fold-change calculations. This ensures statistical rigor and proper handling of count data variance.
2. GSEA Analysis¶
For each ranked gene list:
- Run GSEA using preranked mode
- Test against pathway database (ReactomePathways.gmt)
- Calculate enrichment scores and FDR q-values
- Generate positive and negative enrichment reports
3. Results Combination¶
Combine all GSEA results into a single dataset:
- One row per (patient, timepoint, pathway)
- Includes enrichment score (ES), normalized ES (NES), and FDR q-value
- Missing pathways filled with NaN values
Installation Requirements¶
Python Dependencies¶
The enrichment pipeline requires PyDESeq2 for differential expression analysis:
PyDESeq2 is a Python implementation of the DESeq2 method for differential expression analysis of count data.
Citation:
Muzellec, B., Telenczuk, M., & Cabeli, V. (2022).
PyDESeq2: a python package for bulk RNA-seq differential expression analysis.
bioRxiv, 2022-12.
GSEA CLI Tool¶
- Download GSEA from: https://www.gsea-msigdb.org/gsea/downloads.jsp
- Extract to project root (creates
GSEA_4.3.2/directory) - Ensure
gsea-cli.sh(Unix) orgsea-cli.bat(Windows) is executable
Citation:
Subramanian, A., Tamayo, P., Mootha, V. K., Mukherjee, S., Ebert, B. L., Gillette, M. A., ... & Mesirov, J. P. (2005).
Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles.
Proceedings of the National Academy of Sciences, 102(43), 15545-15550.
Pathway Database¶
The ReactomePathways.gmt file is included in data/external/ReactomePathways.gmt.
Citation:
Jassal, B., Matthews, L., Viteri, G., Gong, C., Lorente, P., Fabregat, A., ... & D'Eustachio, P. (2020).
The reactome pathway knowledgebase.
Nucleic acids research, 48(D1), D498-D503.
Technical Details:
The DESeq2 processing involves:
- Reverse log-transformation: Input data is log-transformed RSEM values, which are converted back to RSEM.
- PyDESeq2 analysis: Proper variance modeling and statistical testing.
- Rank file generation: Genes ranked by log fold-change for GSEA preranked mode.
GSEA Parameters¶
Default GSEA parameters in generate_gsea_command(): - collapse: false - nperm: 1000 permutations - set_max: 500 (maximum pathway size) - set_min: 15 (minimum pathway size)
Output Format¶
Directory Structure¶
output_dir/
├── test_to_test/ # Synthetic trajectories
│ ├── early_to_late/ # Transition type
│ │ ├── patient1_to_patient2/ # Patient trajectory
│ │ │ ├── patient1_to_patient_0.rnk # Ranked gene list for timepoint 0
│ │ │ ├── patient1_to_patient_1.rnk
│ │ │ ├── ...
│ │ │ ├── reports/ # GSEA output for all patients in directory
│ │ │ │ ├── patient1_to_patient_0.GseaPreranked.* # GSEA output files
│ │ │ │ ├── gsea_report_for_na_pos_*.tsv # Positive enrichment report
│ │ │ │ └── gsea_report_for_na_neg_*.tsv # Negative enrichment report
│ │ ├──patient3_to_patient4/
│ │ ├── ...
│ └── gsea_commands_*.cmd # GSEA command files
├── full_gsea_reports_kirc.csv # Final combined results
└── heatmap_kirc_significantNES.csv # Significant NES heatmap data
Final Results Format¶
full_gsea_reports_kirc.csv columns:
Patient: Patient identifier (e.g., "TCGA-CZ-5989-01_to_TCGA-B0-5108-01)Idx: Timepoint index (0 to n_samples-1)Transition: Transition type (e.g., "early_to_late")NAME: Pathway name (from ReactomePathways.gmt)ES: Enrichment scoreNES: Normalized enrichment scoreFDR q-val: False discovery rate q-value
Example:
Patient,Idx,Transition,NAME,ES,NES,FDR q-val
TCGA-CZ-5989-01_to_TCGA-B0-5108-01,0,early_to_late,Cell Cycle,0.65,2.13,0.001
TCGA-CZ-5989-01_to_TCGA-B0-5108-01,0,early_to_late,DNA Repair,0.52,1.87,0.012
TCGA-CZ-5989-01_to_TCGA-B0-5108-01,1,early_to_late,Cell Cycle,0.71,2.31,0.000
GSEA Not Found¶
Error: GSEA CLI not found at ./GSEA_4.3.2/gsea-cli.sh
Solution:
- Download GSEA from https://www.gsea-msigdb.org/gsea/downloads.jsp
- Extract to project root
- Or specify custom path with
--gsea_path
GSEA Command Failures¶
Error: GSEA commands fail with non-zero exit code
Solution:
- Check GSEA installation
- Verify pathway file format (GMT)
- Check file permissions
- Review GSEA log files in output directories
Missing Pathways¶
Warning: Some pathways have all NaN values
Explanation: Normal - not all pathways are significant in every sample
Windows-Specific Issues¶
Error: Cannot run gsea-cli.sh on Windows
Solution:
- Install Git Bash or WSL (Windows Subsystem for Linux)
- Use
gsea-cli.batinstead if available - Or run in WSL environment