Data Requirements¶
Understanding the input data formats and requirements for renalprog.
Note
Even though the package has been used with bulk RNA-seq data and taking cancer stages as labels, the VAE-based pipeline is data and label agnostic. You can use this pipeline with other types of tabular data. For supervised tasks, you can use a CVAE instead of the VAE. For this, please check the API reference.
Overview¶
renalprog is designed to work with gene expression data and clinical annotations. The primary use case is TCGA (The Cancer Genome Atlas) data, but the package can work with any properly formatted RNA-seq or microarray data.
Input Data Formats¶
1. Gene Expression Data¶
Format: CSV or TSV file with genes as columns and samples as rows (or vice versa).
Example (rnaseq.csv):
,Gene1,Gene2,Gene3,...,GeneN
Sample1,12.5,8.3,0.1,...,5.7
Sample2,13.1,7.9,0.0,...,6.2
Sample3,11.8,9.2,0.3,...,5.1
...
Warning
The datasets used by default in this package are in log2-transformed RSEM normalized counts. Before training, the pipeline automatically re-scales the data with the MinMaxScaler from scikit-learn to the [0, 1] range. You should not train the models with raw counts.
2. Clinical Data¶
Format: CSV or TSV file with samples as rows and clinical variables as columns.
Example (clinical.csv):
sample_id,ajcc_pathologic_tumor_stage,age,gender,survival_days
Sample1,Stage I,65,M,1825
Sample2,Stage IIIA,72,F,730
Sample3,Stage II,58,M,2190
...
Required Columns:
sample_id: Must match gene expression sample IDsajcc_pathologic_tumor_stage: Cancer stage (or similar staging variable)- Examples: "Stage I", "Stage II", "Stage IIIA", "Stage IV"
You can specify a different column name for staging when loading data, but you need to pass the name to the functions. By default, dataset.load_clinical_data()() looks for ajcc_pathologic_tumor_stage, but can be changed with the stage_columnparameter. See the API reference for details.
Optional Columns:
age: Patient age at diagnosisgenderorsex: M/F or Male/Female
These columns may be used for synthetic trajectory generation. See the tutorial on synthetic trajectory generation for details.
The pipeline automatically groups stages:
- Early: Stage I, II
- Late: Stage III, IV
TCGA Data¶
Downloading TCGA Data¶
The package includes utilities to download TCGA data from UCSC Xena Browser:
from renalprog import dataset
# Download KIRC data
rnaseq, clinical, pheno = dataset.download_data(
destination='data/raw',
cancer_type='KIRC',
remove_gz=True
)
TCGA Data Structure¶
TCGA data includes:
- Gene Expression (20,531 genes initially)
- Platform: Illumina HiSeq
- Normalization: log2(RSEM+1)
-
File:
EB++AdjustPANCAN_IlluminaHiSeq_RNASeqV2.geneExp.xena -
Clinical Data
- Curated clinical annotations, including survival data.
-
File:
Survival_SupplementalTable_S1_20171025_xena_sp -
Phenotype (sample metadata)
- Sample types (primary tumor, normal, metastatic)
- File:
TCGA_phenotype_denseDataOnlyDownload.tsv
Custom (Non-TCGA) Data¶
Preparing Your Own Data¶
If you have your own gene expression data:
Step 1: Format Gene Expression¶
import pandas as pd
# Load your data
# Assume you have a samples × genes matrix
your_data = pd.read_csv('your_rnaseq_data.csv', index_col=0)
# Ensure proper format
assert your_data.shape[0] < your_data.shape[1], "Transpose if needed"
# Should have more genes than samples
# Check for missing values
assert not your_data.isnull().any().any(), "Remove missing values"
# Save in standard format
your_data.to_csv('data/raw/your_data_formatted.csv')
Step 2: Format Clinical Data¶
# Create clinical file
clinical = pd.DataFrame({
'sample_id': your_data.index,
'ajcc_pathologic_tumor_stage': your_stages, # e.g., ["Stage I", "Stage III", ...]
'race': your_races,
'gender': your_gender
})
# Map to early/late if needed
stage_map = {
'Stage I': 'early',
'Stage II': 'early',
'Stage III': 'late',
'Stage IV': 'late'
}
clinical['stage_binary'] = clinical['ajcc_pathologic_tumor_stage'].map(stage_map)
clinical.to_csv('data/raw/your_clinical_formatted.csv', index=False)
Step 3: Validate¶
from renalprog import dataset
# Try loading
rnaseq = dataset.load_data('data/raw/your_data_formatted.csv')
clinical = dataset.load_data('data/raw/your_clinical_formatted.csv')
# Check alignment
assert set(rnaseq.index) == set(clinical['sample_id']), "Sample IDs must match!"
print(f"Samples: {rnaseq.shape[0]}")
print(f"Genes: {rnaseq.shape[1]}")
print(f"Stages: {clinical['ajcc_pathologic_tumor_stage'].value_counts()}")
Data Preprocessing¶
Automatic Preprocessing¶
The pipeline automatically:
- Filters low-expression genes
- Removes genes with low mean expression
- Removes genes with low variance
-
Keeps genes expressed in ≥20% of samples
-
Detects outliers
- Uses Mahalanobis distance
- Removes samples >3 SD from mean
-
Configurable significance level
-
Normalizes (optional)
- Log transformation
- Z-score normalization
- Quantile normalization
Example Datasets¶
KIRC (Kidney Renal Clear Cell Carcinoma)¶
The main dataset used in publication:
- Samples: 533 (530 with stage)
- Genes: 20,531 (8,516 after preprocessing for feature filtering)
- Source: TCGA via UCSC Xena
Other TCGA Cancers¶
The pipeline works with other TCGA cancers:
# Download different cancer type
rnaseq, clinical, pheno = dataset.download_data(
destination='data/raw',
cancer_type='BRCA' # or LUAD, COAD, etc.
)
For example: BRCA, LUAD, LUSC, COAD, STAD, LIHC, PRAD, etc.
"Data looks weird"¶
# Check if data needs transformation
print("Min:", rnaseq.min().min())
print("Max:", rnaseq.max().max())
# If values are large (>100), likely not log-transformed
if rnaseq.max().max() > 100:
rnaseq_log = np.log2(rnaseq + 1)
rnaseq = rnaseq_log