Assign1: Black Box Machine Learning with sklearn
Due Tuesday, January 16, 11:59pm
SubmitFor this assignment you will create a machine learning model to predict the if a gene is mutated in a patient based on their RNA expression levels.
Data
All datasets are from the cBioPortal. We will be using the following datasets, which all have more than 500 patients with RNA-Seq information.
Name | All | Mutations | RNA-Seq |
---|---|---|---|
Breast Cancer (METABRIC, Nature 2012 & Nat Commun 2016) | 2509 | 2509 | 1904 |
Breast Invasive Carcinoma (TCGA, Firehose Legacy) | 1108 | 982 | 1100 |
Colorectal Adenocarcinoma (TCGA, PanCancer Atlas) | 594 | 534 | 592 |
Kidney Renal Clear Cell Carcinoma (TCGA, Firehose Legacy) | 538 | 528 | 534 |
Brain Lower Grade Glioma (TCGA, Firehose Legacy) | 530 | 513 | 530 |
Uterine Corpus Endometrial Carcinoma (TCGA, PanCancer Atlas) | 529 | 523 | 527 |
Head and Neck Squamous Cell Carcinoma (TCGA, Firehose Legacy) | 530 | 522 | 522 |
Lung Adenocarcinoma (TCGA, Firehose Legacy) | 586 | 230 | 517 |
Thyroid Carcinoma (TCGA, Firehose Legacy) | 516 | 405 | 509 |
Lung Squamous Cell Carcinoma (TCGA, Firehose Legacy) | 511 | 179 | 501 |
Look at the data. Not all datasets are constructed from the same platform and they have different distributions. Some are log values while others are not. It is up to you to implement a reasonable policy for dealing with missing or erroneous data and normalizing values.
Training
Your task is to write a python script that takes as input expression and mutations data files for training, an expression data file for making predictions with, a list of genes, and an output prefix. Your script will train a binary classification model on the training expression data to predict if the provided genes are mutated in patients with a given expression profile. A gene is considered to be mutated if there is at least one mutation for that gene in the mutation file that is not a synonymous_variant. You will output separate files for each gene named PREFIX_GENE.txt with the prediction probabilties/scores output with the patient barcode on each line.For example,
./assign1.py -e brca_tcga/data_mrna_seq_v2_rsem.txt -m brca_tcga/data_mutations.txt -p brca_tcga/data_mrna_agilent_microarray.txt -g RYR2 AKAP9 DNAH11 TP53 UTRN HERC2 DNAH2 PIK3CA -o tcga_tcgawill produce the files:
tcga_tcga_AKAP9.txt tcga_tcga_DNAH11.txt tcga_tcga_DNAH2.txt tcga_tcga_HERC2.txt tcga_tcga_PIK3CA.txt tcga_tcga_RYR2.txt tcga_tcga_TP53.txt tcga_tcga_UTRN.txtwhich look like this:
$ head tcga_tcga_AKAP9.txt TCGA-A1-A0SD-01 0.26866 TCGA-A1-A0SE-01 0.26794 TCGA-A1-A0SH-01 0.27564 TCGA-A1-A0SJ-01 0.26862 TCGA-A1-A0SK-01 0.29348 TCGA-A1-A0SM-01 0.27980 TCGA-A1-A0SO-01 0.29111 TCGA-A1-A0SP-01 0.28748 TCGA-A2-A04N-01 0.26882 TCGA-A2-A04P-01 0.28286You may use any of the features of sklearn. Our expectation is you will use the full data set to design an appropriate pipeline for processing the data (e.g., feature selection) and training the model (e.g. model selection and parameter optimization). Your code should be general and work on any data that has the same format as the example datasets (e.g., you should not do manual feature selection where you pre-specify the features by name).
Here is a scaffold framework for your script.
import numpy as np import pandas as pd import argparse, time import sklearn from sklearn import linear_model, ensemble, svm from sklearn import metrics from sklearn import feature_selection, preprocessing parser = argparse.ArgumentParser(description='Predict if genes are mutated based on mRNA expression data.') parser.add_argument('-e','--train_expr',nargs='+',help='Expression data file(s)',required=True) parser.add_argument('-m','--train_mut',nargs='+',help='Mutation data file(s)',required=True) parser.add_argument('-p','--test_expr',help='Expression data of patients to predict',required=True) parser.add_argument('-g','--gene',help='Hugo symbol of gene(s) to predict if mutated',nargs='+',required=True) parser.add_argument('-o','--output_prefix',help='Output prefix of predictions. Each gene\'s predictions are output to PREFIX_GENE.txt',required=True) args = parser.parse_args() def read_expr(fname): #IMPLEMENT - pandas.read_table recommended return expr test = read_expr(args.test_expr) # read in expression training data (X) for fname in args.train_expr: expr = read_expr(fname) # do univariate feature selection and per-file normalization # accumulate expression data genes = set(args.gene) # read in mutation training data (Y) for fname in args.train_mut: # read in genes with mutations for gene in args.gene: # label-dependent feature selection #train model #predict out = open('%s_%s.txt'%(args.output_prefix,gene),'wt') for (name,p) in sorted(LIST_OF_NAME_PREDICTION_TUPLES): out.write('%s %.5f\n'%(name,p))
Evaluation
Once you have a trained model you are happy with, you should submit your python script below. It will be run using python 3.10.12 and sklearn 1.3.2 on a machine with a 12 core Intel i9-7920X processor and 64GB of RAM. It will be evaluated using the following three commandlines:./assign1.py -e brca_tcga/data_mrna_seq_v2_rsem.txt -m brca_tcga/data_mutations.txt -p brca_tcga/data_mrna_agilent_microarray.txt -g RYR2 AKAP9 DNAH11 TP53 UTRN HERC2 DNAH2 PIK3CA -o tcga_tcga ./assign1.py -e brca_metabric/data_mrna_illumina_microarray.txt -m brca_metabric/data_mutations.txt -p brca_tcga/data_mrna_seq_v2_rsem.txt -g AKAP9 DNAH11 TP53 UTRN RYR2 HERC2 DNAH2 PIK3CA -o meta_tcga ./assign1.py -e \ coadread_tcga_pan_can_atlas_2018/data_mrna_seq_v2_rsem.txt \ hnsc_tcga/data_RNA_Seq_v2_expression_median.txt \ kirc_tcga/data_mrna_seq_v2_rsem.txt \ lgg_tcga/data_mrna_seq_v2_rsem.txt \ luad_tcga/data_RNA_Seq_v2_expression_median.txt \ lusc_tcga/data_mrna_seq_v2_rsem.txt \ thca_tcga/data_RNA_Seq_v2_expression_median.txt \ ucec_tcga_pan_can_atlas_2018/data_mrna_seq_v2_rsem.txt \ -m \ coadread_tcga_pan_can_atlas_2018/data_mutations.txt \ hnsc_tcga/data_mutations_extended.txt \ kirc_tcga/data_mutations.txt \ lgg_tcga/data_mutations.txt \ luad_tcga/data_mutations_extended.txt \ lusc_tcga/data_mutations.txt \ thca_tcga/data_mutations_extended.txt \ ucec_tcga_pan_can_atlas_2018/data_mutations.txt \ -p brca_tcga/data_RNA_Seq_v2_expression_median.txt \ -g AKAP9 DNAH11 TP53 UTRN RYR2 HERC2 DNAH2 PIK3CA \ -o all_tcgaYour predictions will be evaluated using the area under the ROC curve (AUC) averaged across the genes and by how long it takes to run your script. To combine everything into a single number, we will use the following equation (averaged across the three evaluations). $$ score = (AUC-AUC_{baseline})\sqrt{\frac{time_{baseline}}{time}}$$
Note: Your script may not take longer than 10 minutes per a dataset.