Assign2: Non-negative Matrix Factorization with Regularization
Due Thursday, January 25th, 11:59pm
SubmitFor this assignment you will create a non-negative matrix factorization (NMF) algorithm that uses a fused lasso approach for smoothing the latent values. To assess what extent the latent vectors represent biologically meaningful structures, they will be evaluated for their correlation with gene locations.
Data
All of the data can be found here. These bigwig files are from the Segway project (paper). The data files consist of ChIP-seq, DNase-seq, and FAIRE-seq data. The size of the whole genome files are quite large, so if space is a limiting factor you can do all of your testing with the chromosome 10 subsets of the files (*chr10.bw) found here, but your code should be able to function on the files containing the whole genome.The annotation file can be found in the same directory (more specifically it is here). This file delineates where genes are in the genome.
Training
The goal is to write a python script that takes as input the directory containing the bigwig files, the chromosome name, a start position, an end position, and an output filename. The script will run perform NMF using alternating least squares with a fused-lasso regularization to smooth the latent factors. The matrix being factorized has a row for each genome location and a column for each bigwig file (corresponding to *-seq hits). The output latent factors should correlate with externally annotated gene locations.The python code for the proximity operator is provided here and can be significantly sped up with the numba library. Alternatively, for even more of a speed up, you may use cffi with the provided C code (you will need to embed the C code as a string inside your python script).
For example, this
./assign2.py -b . -c chr10 -s 1000000 -e 1500000 -o chromosome_10_loadings.npy -k 10will produce an output file named chromosome_10_loadings.npy containing all of the latent factors (shape: 500000 x 10) computed from the bigwig files in the current directory.
Here is a scaffold framework for your script.
import numpy as np import argparse import pyBigWig from sklearn.utils.extmath import randomized_svd from glob import glob parser = argparse.ArgumentParser(description='Compute NMF to find latent variables with high correlation to transcription start sites') parser.add_argument('-b','--bw_dir',help='Directory containing bigWig file(s)',required=True) parser.add_argument('-c','--chromosome_number',help='Which chromosome to get the values for',required=True) parser.add_argument('-s','--start_pos',type=int,help='Position to start reading from') parser.add_argument('-e','--end_pos',type=int,help='Position to stop reading at') parser.add_argument('-k',type=int,default=10,help='Number of latent vectors') parser.add_argument('-o','--output_file',help='Output file of latent factors matrix.',required=True) args = parser.parse_args() files = glob(f'{args.bw_dir}/*.bw') #construct matrix of values from bigWig files for idx, fname in enumerate(files): # IMPLEMENT -- use pyBigWig to access the .bw files # use args.chromosome_number to access the correct chromosome # use args.start_pos and args.end_pos for the start and end position of the chromosome #deal with NaN, apply any other transformations #setup proximity operator using the provided code def init_H(Y,k): # initialize H # can be a random initialization or using the randomized_svd from sklearn return H def NMF_FL(Y, k, num_iter=50, l2penalty=1, fl_lambda=1): H = init_H(Y,k) #this is the diagonal offset #if l2penalty is small all this does is make the matrix invertible D=np.eye(k) * l2penalty for n in range(num_iter): # Update W # $W \leftarrow Y H^T (H H^T + D)^{-1}$ # Set negative elements of W to 0 # apply fused lasso # Update H # Set negative elements of H to 0 #early stopping? return W, H # Change BIGWIG_DATA to the name of your value matrix # num_iter, l2penalty, and fl_lambda are all hyperparameters that should be tuned to maximize correlation with genes W, H = NMF_FL(BIGWIG_DATA, args.k, num_iter=50, l2penalty=1, fl_lambda=1) np.save(args.output_file,W,allow_pickle=True)
Evaluation
It will be run using python 3.10.12 and pyBigWig 0.3.22. It will be evaluated using the following three commandlines:./assign2.py -b data -c chr10 -s 1000000 -e 1500000 -k 15 -o chromosome_10_loadings.npy # evaluated against Gene ./assign2.py -b data -c chr10 -s 1000000 -e 11000000 -k 10 -o chromosome_10_loadings_long.npy # evaluated against Gene ./assign2.py -b data -c chrX -s 1000000 -e 11000000 -k 5 -o chromosome_X_loadings.npy # evaluated against GeneYour latent values will be evaluated using their maximum correlation with the annotations and by how long it takes to run your script. To combine everything into a single number, we will use the following equation (averaged across the three evaluations). $$ score = (R-R_{baseline})\sqrt{\frac{time_{baseline}}{time}}$$
Note: Your script may not take longer than 10 minutes per a dataset.