MSCBIO 2066: Assignment 8

Assign8: Dreams of Kinase Inhibition

Checkpoint Wednesday, April 17, 11:59pm

Code due Wednesday, April 24, 11:59pm

Writeup due Friday, April 26

For this assignment you will retrospectively participate in the IDG-DREAM Drug-Kinase Binding Prediction Challenge. You may use whatever machine learning models and techniques you think will work best. The top performing solutions from the original competition adopted a wide variety of techniques. You can (and should) read about them in this paper. Descriptions provided by the applicants can be found here (login required).

Data

Your task is to predict the binding affinity of a kinase inhibitor (expressed as the negative log10 of the K_d/K_i/IC50) given a SMILES string and UniProt identifier. Mappings between UniProt and ChEMBL targets can be found here or here.

You may use any publicly available dataset to train your model(s) with the exception that you may not train directly on the supplementary data from the above paper. This will be your validation set (you may use its performance to optimize your hyperparameters). ChEMBL, PubChem, and the Drug Target Commons are good sources of data. The chembl_webresource_client provides a convenient interface for accessing ChEMBL, but it is generally faster to do a bulk download and process all data locally. Note that just getting your data into a usable form will take an obnoxious amount of time, so plan accordingly. To make your life slightly less obnoxious, we provide chembl_activities.txt.gz, which contains all (SMILES, ChEMBL target, pChEMBL) tuples from an older version of ChEMBL.

All UniProt ids present in the various validation/test sets are provided with their reference sequence in test_kinase_seqs.txt and a superset of kinases can be found in kinase_seqs.txt.

Mappings from ChEMBL compound identifiers to SMILES strings can be found in chembl_33_chemreps.txt.gz

Rounds 1 and 2 of the DREAM challenge can be found in test1_labels.txt and test2_labels.txt. Do not use these files for training. They are your validation sets and are for evaluation only.

Training

Train some sort of model or models on some kind of data. You may not use models that have been pretrained on kinase activity data, but you may use other pretrained models (e.g., for protein or compound featurization). Here is an example of a baseline model that "trains" by recording the mean value for every kinase:

#!/usr/bin/env python3

import pandas as pd
import argparse
import cloudpickle # cloudpickle provides the most robust saving of objects

class Model:
    '''A simple null model that predicts the mean of the target's values'''
    def __init__(self, mapfile, affinities):
        #associate chembl target names with uniprot names
        c2u = dict()
        for line in open(mapfile):
            u,c = line.strip().split()
            c2u[c] = u

        targets = set(c2u.keys()) #chembl targets
        df = pd.read_csv(affinities,delim_whitespace=True)
        means = df[df.target.isin(targets)].groupby("target").mean()

        self.mean = means.mean().item() #overall ave of averages
        #save indexed by uniprot
        self.u2val = dict()
        for r,row in means.iterrows():
            self.u2val[c2u[row.name]] = row.pchembl    
    
    def predict(self, smile, uniprot):
        '''Predict binding affinity of smile compound to uniprot kinase'''
        if uniprot in self.u2val:
            return self.u2val[uniprot]
        else:
            return self.mean

parser = argparse.ArgumentParser(description='Random predictions.')
parser.add_argument('--map',type=str, help='map from uniprot to chembl target of interest',required=True)
parser.add_argument('--affinities',type=str, help='ChEMBL affinities file',required=True)
parser.add_argument('--out',type=str, default='model.pkl', help="Output model")
args = parser.parse_args()

model = Model(args.map, args.affinities)
cloudpickle.dump(model,open(args.out,'wb'))

Evaluation

You should submit code that runs an already trained model on the provided input. Example input files: test1.txt, test2.txt, and indep.txt. The expectation is your code will dynamically download your already trained model. For example:

#!/usr/bin/env python3
import cloudpickle, sys
import fsspec

#fetch and load model from Google Cloud
model = cloudpickle.load(fsspec.open_files("gs://mscbio2066-data/model.pkl","rb")[0].open())

infile = open(sys.argv[1])
out = open(sys.argv[2],'wt')

out.write(infile.readline()) # header
for line in infile:
    smile,uniprot = line.strip().split()
    val = model.predict(smile,uniprot)
    out.write(f'{smile} {uniprot} {val:.4f}\n')

Your output is the input file with a column added containing the predicted affinities. Your code must complete its predictions within 20 minutes.

The indep.txt file is an independent test set that contains unpublished data. Some of the ligand protein pairs exhibited no activity, so in addition to evaluating the Spearman correlation of your predictions, we will also calculate the ROC AUC when your predictor is used as a classifier. Due the small size of this set we also report the p-value of the Spearman correlation and the 90% confidence interval of the AUC.

Checkpoint (10 pts)

You must submit code that outperforms the mean baseline by the checkpoint date or you will receive a 10 point penalty.

Writeup (10 pts)

Describe your approach in plain English using 0.5-2 pages of text. You may include as many figures or code listings as you would like (not subject to the page limit). There should be sufficient detail to reproduce your results (including code can satisfy this requirement). Include commentary on what did and did not work and anything that surprised you. Poorer performing solutions may receive additional credit if it is clear that the approach was well-thought out and significant effort was expended trying to get it to work.

Email a pdf of your writeup to dkoes and mchikina before April 27.

Evaluation

Your code will be evaluated on a machine with 12 cores, 64GB of RAM, and 24GB of GPU RAM. It will be easiest to unpickle your model if you use the same versions of the software that are installed on the grader (or you can experiment with pickling modules by value):

Software	Version
Python	3.10.12
sklearn	1.3.2
xgboost	1.7.3
torch	2.0.0+cu117
lightning	2.2.1
rdkit	2022.09.4
openbabel	3.1.1
scipy	1.10.1
numpy	1.23.5
pandas	2.1.4
BioPython	1.80
deepchem	2.7.1
jax	0.4.8
mol2vec	0.1
sgt	2.0.3

Other packages can be installed as needed.

Grading

You need to exceed the Spearman correlation on the test set and at least one validation set of the A solution to receive A-level credit (75/80). Up to 10 bonus points will be awarded to solutions that beat the competition's best performers (S = 0.56 and 0.53 on the two validation sets).

User ID (private):	@pitt.edu
Avatar Name (public):
Code: