MSCBIO 2066 Assignment 1

Assign1: Black Box Machine Learning with sklearn

Due Tuesday, January 16, 11:59pm

For this assignment you will create a machine learning model to predict the if a gene is mutated in a patient based on their RNA expression levels.

Data

All datasets are from the cBioPortal. We will be using the following datasets, which all have more than 500 patients with RNA-Seq information.

Name All Mutations RNA-Seq

Breast Cancer (METABRIC, Nature 2012 & Nat Commun 2016) 2509 2509 1904

Breast Invasive Carcinoma (TCGA, Firehose Legacy) 1108 982 1100

Colorectal Adenocarcinoma (TCGA, PanCancer Atlas) 594 534 592

Kidney Renal Clear Cell Carcinoma (TCGA, Firehose Legacy) 538 528 534

Brain Lower Grade Glioma (TCGA, Firehose Legacy) 530 513 530

Uterine Corpus Endometrial Carcinoma (TCGA, PanCancer Atlas) 529 523 527

Head and Neck Squamous Cell Carcinoma (TCGA, Firehose Legacy) 530 522 522

Lung Adenocarcinoma (TCGA, Firehose Legacy) 586 230 517

Thyroid Carcinoma (TCGA, Firehose Legacy) 516 405 509

Lung Squamous Cell Carcinoma (TCGA, Firehose Legacy) 511 179 501

Name	All	Mutations	RNA-Seq
Breast Cancer (METABRIC, Nature 2012 & Nat Commun 2016)	2509	2509	1904
Breast Invasive Carcinoma (TCGA, Firehose Legacy)	1108	982	1100
Colorectal Adenocarcinoma (TCGA, PanCancer Atlas)	594	534	592
Kidney Renal Clear Cell Carcinoma (TCGA, Firehose Legacy)	538	528	534
Brain Lower Grade Glioma (TCGA, Firehose Legacy)	530	513	530
Uterine Corpus Endometrial Carcinoma (TCGA, PanCancer Atlas)	529	523	527
Head and Neck Squamous Cell Carcinoma (TCGA, Firehose Legacy)	530	522	522
Lung Adenocarcinoma (TCGA, Firehose Legacy)	586	230	517
Thyroid Carcinoma (TCGA, Firehose Legacy)	516	405	509
Lung Squamous Cell Carcinoma (TCGA, Firehose Legacy)	511	179	501

Look at the data. Not all datasets are constructed from the same platform and they have different distributions. Some are log values while others are not. It is up to you to implement a reasonable policy for dealing with missing or erroneous data and normalizing values.

Training

Your task is to write a python script that takes as input expression and mutations data files for training, an expression data file for making predictions with, a list of genes, and an output prefix. Your script will train a binary classification model on the training expression data to predict if the provided genes are mutated in patients with a given expression profile. A gene is considered to be mutated if there is at least one mutation for that gene in the mutation file that is not a synonymous_variant. You will output separate files for each gene named PREFIX_GENE.txt with the prediction probabilties/scores output with the patient barcode on each line.

For example,

 ./assign1.py -e brca_tcga/data_mrna_seq_v2_rsem.txt -m brca_tcga/data_mutations.txt -p brca_tcga/data_mrna_agilent_microarray.txt -g RYR2  AKAP9 DNAH11 TP53 UTRN HERC2 DNAH2 PIK3CA -o tcga_tcga

will produce the files:

tcga_tcga_AKAP9.txt
tcga_tcga_DNAH11.txt
tcga_tcga_DNAH2.txt
tcga_tcga_HERC2.txt
tcga_tcga_PIK3CA.txt
tcga_tcga_RYR2.txt
tcga_tcga_TP53.txt
tcga_tcga_UTRN.txt

which look like this:

$ head tcga_tcga_AKAP9.txt 
TCGA-A1-A0SD-01 0.26866
TCGA-A1-A0SE-01 0.26794
TCGA-A1-A0SH-01 0.27564
TCGA-A1-A0SJ-01 0.26862
TCGA-A1-A0SK-01 0.29348
TCGA-A1-A0SM-01 0.27980
TCGA-A1-A0SO-01 0.29111
TCGA-A1-A0SP-01 0.28748
TCGA-A2-A04N-01 0.26882
TCGA-A2-A04P-01 0.28286

You may use any of the features of sklearn. Our expectation is you will use the full data set to design an appropriate pipeline for processing the data (e.g., feature selection) and training the model (e.g. model selection and parameter optimization). Your code should be general and work on any data that has the same format as the example datasets (e.g., you should not do manual feature selection where you pre-specify the features by name).

Here is a scaffold framework for your script.

import numpy as np
import pandas as pd
import argparse, time
import sklearn
from sklearn import linear_model, ensemble, svm
from sklearn import metrics
from sklearn import feature_selection, preprocessing

parser = argparse.ArgumentParser(description='Predict if genes are mutated based on mRNA expression data.')
parser.add_argument('-e','--train_expr',nargs='+',help='Expression data file(s)',required=True)
parser.add_argument('-m','--train_mut',nargs='+',help='Mutation data file(s)',required=True)
parser.add_argument('-p','--test_expr',help='Expression data of patients to predict',required=True)
parser.add_argument('-g','--gene',help='Hugo symbol of gene(s) to predict if mutated',nargs='+',required=True)
parser.add_argument('-o','--output_prefix',help='Output prefix of predictions. Each gene\'s predictions are output to PREFIX_GENE.txt',required=True)

args = parser.parse_args()

def read_expr(fname):
    #IMPLEMENT - pandas.read_table recommended
    return expr
    
test = read_expr(args.test_expr)
   
# read in expression training data (X)
for fname in args.train_expr:
    expr = read_expr(fname)
    # do univariate feature selection and per-file normalization    
    # accumulate expression data

genes = set(args.gene)
# read in mutation training data (Y)
for fname in args.train_mut:
    # read in genes with mutations

for gene in args.gene:
    # label-dependent feature selection
    #train model
    #predict

    out = open('%s_%s.txt'%(args.output_prefix,gene),'wt')
    for (name,p) in sorted(LIST_OF_NAME_PREDICTION_TUPLES):
        out.write('%s %.5f\n'%(name,p))

Evaluation

Once you have a trained model you are happy with, you should submit your python script below. It will be run using python 3.10.12 and sklearn 1.3.2 on a machine with a 12 core Intel i9-7920X processor and 64GB of RAM. It will be evaluated using the following three commandlines:

./assign1.py -e brca_tcga/data_mrna_seq_v2_rsem.txt -m brca_tcga/data_mutations.txt -p brca_tcga/data_mrna_agilent_microarray.txt -g RYR2  AKAP9 DNAH11 TP53 UTRN HERC2 DNAH2 PIK3CA -o tcga_tcga

./assign1.py -e brca_metabric/data_mrna_illumina_microarray.txt -m brca_metabric/data_mutations.txt -p brca_tcga/data_mrna_seq_v2_rsem.txt  -g AKAP9 DNAH11 TP53 UTRN RYR2 HERC2 DNAH2 PIK3CA -o meta_tcga

./assign1.py -e  \
  coadread_tcga_pan_can_atlas_2018/data_mrna_seq_v2_rsem.txt \
  hnsc_tcga/data_RNA_Seq_v2_expression_median.txt  \
  kirc_tcga/data_mrna_seq_v2_rsem.txt \
  lgg_tcga/data_mrna_seq_v2_rsem.txt  \
  luad_tcga/data_RNA_Seq_v2_expression_median.txt  \
  lusc_tcga/data_mrna_seq_v2_rsem.txt  \
  thca_tcga/data_RNA_Seq_v2_expression_median.txt \
  ucec_tcga_pan_can_atlas_2018/data_mrna_seq_v2_rsem.txt  \
  -m  \
  coadread_tcga_pan_can_atlas_2018/data_mutations.txt  \
  hnsc_tcga/data_mutations_extended.txt  \
  kirc_tcga/data_mutations.txt  \
  lgg_tcga/data_mutations.txt  \
  luad_tcga/data_mutations_extended.txt  \
  lusc_tcga/data_mutations.txt \
  thca_tcga/data_mutations_extended.txt \
  ucec_tcga_pan_can_atlas_2018/data_mutations.txt \
  -p brca_tcga/data_RNA_Seq_v2_expression_median.txt \
  -g AKAP9 DNAH11 TP53 UTRN RYR2 HERC2 DNAH2 PIK3CA \
  -o all_tcga

Your predictions will be evaluated using the area under the ROC curve (AUC) averaged across the genes and by how long it takes to run your script. To combine everything into a single number, we will use the following equation (averaged across the three evaluations). $$ score = (AUC-AUC_{baseline})\sqrt{\frac{time_{baseline}}{time}}$$

Note: Your script may not take longer than 10 minutes per a dataset.

User ID (private):	@pitt.edu
Avatar Name (public):
Code: