Assign4: Parallel Featurization, Gradient Boosted Trees, and Google Cloud
Due Tuesday, February 13th, 11:59pm
SubmitFor this assignment you will infer cell-type specific chromatin activity solely from DNA sequence motifs using XGBoost. We are inspired by this paper. You will use a set of provided sequence motifs to featurize the whole genome and train a gradient boosted tree ensemble to predict ATAC-seq peaks. The featurization processes is compute intensive, so we will be using Google Cloud to evaluate your code on a 64-core virtual machine (specifically a n1-highcpu-64 machine type).
Google Cloud
You will need to setup a Google Cloud account. A link to receive $50 in credits was posted to slack. Additional credits may be available if needed, but Google is really cutting back on their support for education this year, so please be careful with your credit usage. We have already setup an image with the required software and data for the assignment. You can spin up a virtual machine equivalent to what your code will be graded on with the command:gcloud beta compute --project=$PROJECT instances create my-amazing-virtual-machine --zone=us-central1-f --machine-type=n1-highcpu-64 \ --image=scalable2024 --image-project=xgboost-305718 --boot-disk-size=32GB --boot-disk-type=pd-ssdYou will need to request more quota for CPUs (when you try to launch your VM it will tell you what quotas you need). Don't forget to shutdown your VM when you are done with it. Otherwise you will burn through your credits doing nothing.
Data
All of the data can be found here. These files are also already loaded on our provided GCE image in the /data directory. You can fetch all of them with gsutil (this will be especially fast when transferring to a GCE VM):gsutil -m cp gs://mscbio2066-data/* .
| File | Format | Description |
|---|---|---|
| mm10.fa | FASTA | The GRCm38/mm10 reference genome (source) |
| cisBP_mouse.homer | HOMER | 862 transcription factor DNA binding motifs(source) |
| train.B.bed | BED | Genomic subsequences (all of length 250) where ATAC-seq peaks were observed in immune cells. Name (4th column) is average peak value for given cell type (e.g. B) |
| test.B.bed | BED | Genomic subsequences (all of length 250) where ATAC-seq peaks were observed. No label provided. |
Featurization
Each training example is a 250bp subsequence. Your goal is to predict the ATAC-seq (chromatin accessibility) from the sequence.Motifs
You will apply all 862 motifs in the provided file to the genomic subsequences specified. The provided motifs have a threshold embedded in their name. For example, the first motif:>Nobox+M1945_1.02+D Nobox+M1945_1.02+D/NA 4.62841260470134 0 0.0263157894736839 0 0.973684210526316 0.947368421052632 0 0 0.0526315789473681 1 0 0 0 0.0263157894736841 0.0263157894736841 0.0526315789473682 0.894736842105264 0.0526315789473681 0 0.10526315789474 0.842105263157892 0.394736842105262 0 0.578947368421054 0.0263157894736842 0.105263157894739 0.315789473684207 0.473684210526315 0.105263157894739 0.0526315789473681 0.342105263157891 0.15789473684211 0.447368421052631This motif has a recommended threshold of 4.628. That is, a subsequence is considered a "hit" if the application of the PSSM results in a score greater than this value. Motifs can be read in using biopython and the pfm-four-columns format. Biopython can also be used to search for matches. For example:
print(list(m.pssm.search(seq,both=False))) [(61, 2.6058177947998047), (73, 2.1033174991607666)]The result lists all the positions where the score is above a threshold (default zero).
Features
It is up to you how to convert the results of the application of the PSSM to features. At a minimum you should have one feature per every motif. You might have binary features (are there any motif matches above the recommended threshold), counts (the number of such matches), or numerical features (max score or mean score), or some (small) combination. You may only generate features that are a function of the sequence (e.g. you cannot include the start/end position or chromosome information).Performance
The core of the PSSM search is Bio.motifs._pwm.calculate which is fast compiled C code. However, it is wrapped in a lot of Python that does error checking and other redundant calculations (e.g. recalculating the logodds matrix of a motif). To make your code more performant you will want to call Bio.motifs._pwm.calculate directly and avoid unnecessary recomputations.To have any hope of completing in the required time, you will have to parallelize the featurization process across all available cores. You should use multiprocessing.Pool (or multiprocess).
Training
The goal is to write a python script that takes as input a genome in FASTA format, a motif specification, training input in minimal BED format (with the name of each subsequence being the label), and a test BED file without labels. You will output predicted labels to the requested output file as a one-dimensional numpy array.For example, this
./assign4.py -g mm10.fa -t train.B.bed -m cisBP_mouse.homer -p test.B.bed -o out.npywill produce an output file named out.npy containing a one-dimensional output array of predicted labels. You must keep the same ordering as the input.
Here is a scaffold framework for your script.
#!/usr/bin/env python3
import Bio
import Bio.motifs as motifs
from Bio import SeqIO
import numpy as np
import xgboost as xgb
import argparse, sys,time,os
import multiprocessing
if __name__ == '__main__':
parser = argparse.ArgumentParser(description='Predict ATAC-seq data from sequence motifs')
parser.add_argument('-g','--genome',help='reference genome FASTA file',required=True)
parser.add_argument('-t','--train',help='training bed file with chromosome, start, end and label',required=True)
parser.add_argument('-m','--motifs',help='file of motifs to use as features',required=True)
parser.add_argument('-p','--predict',help='training bed file for prediction with chromosome, start, end', required=True)
parser.add_argument('-o','--output_file',help='Output predictions',required=True)
args = parser.parse_args()
#featurize input in parallel
#train xgboost model
np.save(args.output_file,predicts)
You will need to determine the best hyper-parameters for your XGBoost model. Note that while your submitted solution must featurize the data starting from the sequences, when optimizing your model you will most certainly want to work on a pre-processed feature matrix. Try to do as much development on your own machine as possible (to avoid using credits), but check that your solution is using all 64 cores before submitting (run top while your script is running).
Evaluation
Your code will be run using the image described above. Since these runs are time consuming we will only evaluate two runs:./assign4.py -g mm10.fa -t train.small.bed -m cisBP_mouse.homer -p test.small.bed -o out.npy ./assign4.py -g mm10.fa -t train.B.bed -m cisBP_mouse.homer -p test.B.bed -o out.npyYour output will be evaluated using the Pearson correlation coefficient to the true labels and the running time. To combine everything into a single number, we will use the following equation (averaged across the two evaluations). $$ score = (R-R_{baseline})\sqrt{\frac{time_{baseline}}{time}}$$ Additionally, to encourage you to get the best possible predictions, solutions with the highest correlations on the larger B dataset will receive a quality bonus that is added to the final grade (max grade remains 100). A correlation above 0.59 on the larger dataset is an automatic +5.
Your code must complete within 30 minutes on each data set.