Assign1: Black Box Machine Learning with sklearn

Due Tuesday, January 16, 11:59pm

Submit

For this assignment you will create a machine learning model to predict the if a gene is mutated in a patient based on their RNA expression levels.

Data

All datasets are from the cBioPortal. We will be using the following datasets, which all have more than 500 patients with RNA-Seq information.

NameAllMutationsRNA-Seq
Breast Cancer (METABRIC, Nature 2012 & Nat Commun 2016) 250925091904
Breast Invasive Carcinoma (TCGA, Firehose Legacy) 11089821100
Colorectal Adenocarcinoma (TCGA, PanCancer Atlas)594534592
Kidney Renal Clear Cell Carcinoma (TCGA, Firehose Legacy) 538528534
Brain Lower Grade Glioma (TCGA, Firehose Legacy) 530513530
Uterine Corpus Endometrial Carcinoma (TCGA, PanCancer Atlas) 529523527
Head and Neck Squamous Cell Carcinoma (TCGA, Firehose Legacy) 530522522
Lung Adenocarcinoma (TCGA, Firehose Legacy) 586230517
Thyroid Carcinoma (TCGA, Firehose Legacy) 516405509
Lung Squamous Cell Carcinoma (TCGA, Firehose Legacy) 511179501

Look at the data. Not all datasets are constructed from the same platform and they have different distributions. Some are log values while others are not. It is up to you to implement a reasonable policy for dealing with missing or erroneous data and normalizing values.

Training

Your task is to write a python script that takes as input expression and mutations data files for training, an expression data file for making predictions with, a list of genes, and an output prefix. Your script will train a binary classification model on the training expression data to predict if the provided genes are mutated in patients with a given expression profile. A gene is considered to be mutated if there is at least one mutation for that gene in the mutation file that is not a synonymous_variant. You will output separate files for each gene named PREFIX_GENE.txt with the prediction probabilties/scores output with the patient barcode on each line.

For example,

 ./assign1.py -e brca_tcga/data_mrna_seq_v2_rsem.txt -m brca_tcga/data_mutations.txt -p brca_tcga/data_mrna_agilent_microarray.txt -g RYR2  AKAP9 DNAH11 TP53 UTRN HERC2 DNAH2 PIK3CA -o tcga_tcga
will produce the files:
tcga_tcga_AKAP9.txt
tcga_tcga_DNAH11.txt
tcga_tcga_DNAH2.txt
tcga_tcga_HERC2.txt
tcga_tcga_PIK3CA.txt
tcga_tcga_RYR2.txt
tcga_tcga_TP53.txt
tcga_tcga_UTRN.txt
which look like this:
$ head tcga_tcga_AKAP9.txt 
TCGA-A1-A0SD-01 0.26866
TCGA-A1-A0SE-01 0.26794
TCGA-A1-A0SH-01 0.27564
TCGA-A1-A0SJ-01 0.26862
TCGA-A1-A0SK-01 0.29348
TCGA-A1-A0SM-01 0.27980
TCGA-A1-A0SO-01 0.29111
TCGA-A1-A0SP-01 0.28748
TCGA-A2-A04N-01 0.26882
TCGA-A2-A04P-01 0.28286
You may use any of the features of sklearn. Our expectation is you will use the full data set to design an appropriate pipeline for processing the data (e.g., feature selection) and training the model (e.g. model selection and parameter optimization). Your code should be general and work on any data that has the same format as the example datasets (e.g., you should not do manual feature selection where you pre-specify the features by name).

Here is a scaffold framework for your script.

import numpy as np
import pandas as pd
import argparse, time
import sklearn
from sklearn import linear_model, ensemble, svm
from sklearn import metrics
from sklearn import feature_selection, preprocessing

parser = argparse.ArgumentParser(description='Predict if genes are mutated based on mRNA expression data.')
parser.add_argument('-e','--train_expr',nargs='+',help='Expression data file(s)',required=True)
parser.add_argument('-m','--train_mut',nargs='+',help='Mutation data file(s)',required=True)
parser.add_argument('-p','--test_expr',help='Expression data of patients to predict',required=True)
parser.add_argument('-g','--gene',help='Hugo symbol of gene(s) to predict if mutated',nargs='+',required=True)
parser.add_argument('-o','--output_prefix',help='Output prefix of predictions. Each gene\'s predictions are output to PREFIX_GENE.txt',required=True)

args = parser.parse_args()

def read_expr(fname):
    #IMPLEMENT - pandas.read_table recommended
    return expr
    
test = read_expr(args.test_expr)
   
# read in expression training data (X)
for fname in args.train_expr:
    expr = read_expr(fname)
    # do univariate feature selection and per-file normalization    
    # accumulate expression data

genes = set(args.gene)
# read in mutation training data (Y)
for fname in args.train_mut:
    # read in genes with mutations

for gene in args.gene:
    # label-dependent feature selection
    #train model
    #predict

    out = open('%s_%s.txt'%(args.output_prefix,gene),'wt')
    for (name,p) in sorted(LIST_OF_NAME_PREDICTION_TUPLES):
        out.write('%s %.5f\n'%(name,p))

Evaluation

Once you have a trained model you are happy with, you should submit your python script below. It will be run using python 3.10.12 and sklearn 1.3.2 on a machine with a 12 core Intel i9-7920X processor and 64GB of RAM. It will be evaluated using the following three commandlines:
./assign1.py -e brca_tcga/data_mrna_seq_v2_rsem.txt -m brca_tcga/data_mutations.txt -p brca_tcga/data_mrna_agilent_microarray.txt -g RYR2  AKAP9 DNAH11 TP53 UTRN HERC2 DNAH2 PIK3CA -o tcga_tcga

./assign1.py -e brca_metabric/data_mrna_illumina_microarray.txt -m brca_metabric/data_mutations.txt -p brca_tcga/data_mrna_seq_v2_rsem.txt  -g AKAP9 DNAH11 TP53 UTRN RYR2 HERC2 DNAH2 PIK3CA -o meta_tcga

./assign1.py -e  \
  coadread_tcga_pan_can_atlas_2018/data_mrna_seq_v2_rsem.txt \
  hnsc_tcga/data_RNA_Seq_v2_expression_median.txt  \
  kirc_tcga/data_mrna_seq_v2_rsem.txt \
  lgg_tcga/data_mrna_seq_v2_rsem.txt  \
  luad_tcga/data_RNA_Seq_v2_expression_median.txt  \
  lusc_tcga/data_mrna_seq_v2_rsem.txt  \
  thca_tcga/data_RNA_Seq_v2_expression_median.txt \
  ucec_tcga_pan_can_atlas_2018/data_mrna_seq_v2_rsem.txt  \
  -m  \
  coadread_tcga_pan_can_atlas_2018/data_mutations.txt  \
  hnsc_tcga/data_mutations_extended.txt  \
  kirc_tcga/data_mutations.txt  \
  lgg_tcga/data_mutations.txt  \
  luad_tcga/data_mutations_extended.txt  \
  lusc_tcga/data_mutations.txt \
  thca_tcga/data_mutations_extended.txt \
  ucec_tcga_pan_can_atlas_2018/data_mutations.txt \
  -p brca_tcga/data_RNA_Seq_v2_expression_median.txt \
  -g AKAP9 DNAH11 TP53 UTRN RYR2 HERC2 DNAH2 PIK3CA \
  -o all_tcga 

Your predictions will be evaluated using the area under the ROC curve (AUC) averaged across the genes and by how long it takes to run your script. To combine everything into a single number, we will use the following equation (averaged across the three evaluations). $$ score = (AUC-AUC_{baseline})\sqrt{\frac{time_{baseline}}{time}}$$

Note: Your script may not take longer than 10 minutes per a dataset.

@pitt.edu