Assign2: Non-negative Matrix Factorization with Regularization

Due Thursday, January 25th, 11:59pm

Submit

For this assignment you will create a non-negative matrix factorization (NMF) algorithm that uses a fused lasso approach for smoothing the latent values. To assess what extent the latent vectors represent biologically meaningful structures, they will be evaluated for their correlation with gene locations.

Data

All of the data can be found here. These bigwig files are from the Segway project (paper). The data files consist of ChIP-seq, DNase-seq, and FAIRE-seq data. The size of the whole genome files are quite large, so if space is a limiting factor you can do all of your testing with the chromosome 10 subsets of the files (*chr10.bw) found here, but your code should be able to function on the files containing the whole genome.

The annotation file can be found in the same directory (more specifically it is here). This file delineates where genes are in the genome.

Training

The goal is to write a python script that takes as input the directory containing the bigwig files, the chromosome name, a start position, an end position, and an output filename. The script will run perform NMF using alternating least squares with a fused-lasso regularization to smooth the latent factors. The matrix being factorized has a row for each genome location and a column for each bigwig file (corresponding to *-seq hits). The output latent factors should correlate with externally annotated gene locations.

The python code for the proximity operator is provided here and can be significantly sped up with the numba library. Alternatively, for even more of a speed up, you may use cffi with the provided C code (you will need to embed the C code as a string inside your python script).

For example, this

./assign2.py -b . -c chr10 -s 1000000 -e 1500000 -o chromosome_10_loadings.npy -k 10
will produce an output file named chromosome_10_loadings.npy containing all of the latent factors (shape: 500000 x 10) computed from the bigwig files in the current directory.

Here is a scaffold framework for your script.

import numpy as np
import argparse
import pyBigWig
from sklearn.utils.extmath import randomized_svd
from glob import glob

parser = argparse.ArgumentParser(description='Compute NMF to find latent variables with high correlation to transcription start sites')
parser.add_argument('-b','--bw_dir',help='Directory containing bigWig file(s)',required=True)
parser.add_argument('-c','--chromosome_number',help='Which chromosome to get the values for',required=True)
parser.add_argument('-s','--start_pos',type=int,help='Position to start reading from')
parser.add_argument('-e','--end_pos',type=int,help='Position to stop reading at')
parser.add_argument('-k',type=int,default=10,help='Number of latent vectors')
parser.add_argument('-o','--output_file',help='Output file of latent factors matrix.',required=True)

args = parser.parse_args()

files = glob(f'{args.bw_dir}/*.bw')

#construct matrix of values from bigWig files
for idx, fname in enumerate(files):
    # IMPLEMENT -- use pyBigWig to access the .bw files
	# use args.chromosome_number to access the correct chromosome
	# use args.start_pos and args.end_pos for the start and end position of the chromosome

    #deal with NaN, apply any other transformations

#setup proximity operator using the provided code

def init_H(Y,k):
	# initialize H
	# can be a random initialization or using the randomized_svd from sklearn

	return H

def NMF_FL(Y, k, num_iter=50, l2penalty=1, fl_lambda=1):
    H = init_H(Y,k)

    #this is the diagonal offset
    #if l2penalty is small all this does is make the matrix invertible
    D=np.eye(k) * l2penalty
    for n in range(num_iter):
        # Update W
        # $W \leftarrow Y H^T (H H^T + D)^{-1}$

        # Set negative elements of W to 0

        # apply fused lasso

        # Update H

        # Set negative elements of H to 0

        #early stopping?
    return W, H

# Change BIGWIG_DATA to the name of your value matrix
# num_iter, l2penalty, and fl_lambda are all hyperparameters that should be tuned to maximize correlation with genes
W, H = NMF_FL(BIGWIG_DATA, args.k, num_iter=50, l2penalty=1, fl_lambda=1)

np.save(args.output_file,W,allow_pickle=True)

Evaluation

It will be run using python 3.10.12 and pyBigWig 0.3.22. It will be evaluated using the following three commandlines:
./assign2.py -b data -c chr10 -s 1000000 -e 1500000  -k 15 -o chromosome_10_loadings.npy # evaluated against Gene
./assign2.py -b data -c chr10 -s 1000000 -e 11000000 -k 10 -o chromosome_10_loadings_long.npy # evaluated against Gene
./assign2.py -b data -c chrX  -s 1000000 -e 11000000 -k 5 -o chromosome_X_loadings.npy # evaluated against Gene
Your latent values will be evaluated using their maximum correlation with the annotations and by how long it takes to run your script. To combine everything into a single number, we will use the following equation (averaged across the three evaluations). $$ score = (R-R_{baseline})\sqrt{\frac{time_{baseline}}{time}}$$

Note: Your script may not take longer than 10 minutes per a dataset.

@pitt.edu