Assign3: Unsupervised Clustering vs Supervised Labeling
Due Friday, February 2nd, 11:59pm
SubmitFor this assignment you will perform unsupervised clustering and supervised labeling of single cell data. For these tasks you will use the UMAP packages (note that you want to install the umap-learn package, not umap which is something entirely different), FAISS and sklearn packages. FAISS will be used for K-Means clustering, and we will improve the quality of the clustering by first reducing the dimensionality of the data with UMAP. Note that clustering in a reduced dimension does not always achieve better results, but we are conveniently evaluating your code on data sets where it does help (for the right choice of reduced dimension size). For the supervised labeling task, you are encouraged to use any of the classifiers provided by sklearn to predict the cell_ontology_class. (Hint: you should try several different types of classifiers before doing hyperparameter searches)
Data
All of the data can be found here. This is the data used in this paper. These are anndata files, which you will recall are hd5 files formatted for storing annotated dense matrices. We recommend using scanpy to read these in. Of particular use is the filter_genes method which let's you intelligently downsample the number of columns (genes) in the data.Notice: You should not remove any rows (cells) from the data, as you are expected to generate a cluster id for each observation. Also notice that for the supervised labeling task, the features (genes) must be the same across test and train datasets.
Training
The goal is to write a python script that takes as input an anndata file (the data to be clustered/labeled), optionally an integer K (the number of clusters for unsupervised clustering), and optionally another anndata file (the training data for supervised labeling). Specifically, if your script is given an anndata file and a K, then it performs unsupervised clustering and assigns [0,K) cluster ids to each observation (single cell) in the provided anndata file and saves the output to an output file. Otherwise, if your script is given two anndata files (one for training, one for testing), then it performs supervised labeling, fitting a model to predict the cell_ontology_class from the training data and then using this model to make predictions for the test file, saving these predictions to an output file.After reading in the data using scanpy and performing any feature selection/normalization steps you choose, the data will be reduced in dimension using UMAP.
Despite claims to the contrary, UMAP isn't that stable with respect to the chosen random seed. Our solution sets the random_state to 42 (because). Please explore other, more rational, hyperparameters before tweaking the random seed. Next, you will use the UMAP projection to either perform KMeans clustering (if you were provided a K as an argument) or supervised labeling (if you were provided training data as an argument).
For the supervised labeling task you are allowed to combine the test and train datasets to perform UMAP. Anndata has a useful concat function that may come in handy here. The feature (gene) names of the test and train data need to be the same prior to concatenation. The different datasets use different conventions for the gene names. The following should (mostly) unify the representations:
data.var_names = [v.split('_')[-1] for v in data.var_names] data.var_names_make_unique()
For example, this
./assign3.py -d mouse_spatial_brain_section1.h5ad -k 4 -o clusters.npywill produce an output file named clusters.npy containing a one-dimensional output array of cluster identifiers. You must keep the same sample ordering as the input. Note that the output format is the same for both the unsupervised and supervised case.
Here is a scaffold framework for your script.
#!/usr/bin/env python3 import numpy as np import umap import faiss import scanpy import argparse import warnings warnings.simplefilter(action='ignore', category=FutureWarning) #ignore warning about files we are using parser = argparse.ArgumentParser(description='Unsupervised or Supervised Learning using UMAP, FAISS, and sklearn') parser.add_argument('-d','--data',help='unlabeled anndata input file',required=True) parser.add_argument('-k',type=int,help='Number of clusters to identify',required=False) parser.add_argument('-t','--train_data',help='labeled anndata file for training',required=False) parser.add_argument('-o','--output_file',help='Output file of cluster assignments (npy).',required=True) args = parser.parse_args() #read and preprocess input file(s) #compute umap embedding (you determine a good dimension, probably not 2 or k) #either kmeans cluster or sklearn #extract cluster ids into I, a flat one-dimensional array np.save(args.output_file,I,allow_pickle=True)
Evaluation
Your code will be run using python 3.10.12, scanpy 1.9.6, umap 0.5.5, and faiss 1.7.3. It will be evaluated using the following four commandlines:./assign3.py -d mouse_spatial_brain_section1_modified.h5ad -k 4 -o clusters.npy # unsupervised clustering ./assign3.py -d mouse_cortex_methods_comparison_log1p_cpm_modified.h5ad -k 8 -o clusters.npy # unsupervised clustering ./assign3.py -t mouse_spatial_brain_section0.h5ad -d mouse_spatial_brain_section1_modified.h5ad -o clusters.npy # supervised labeling ./assign3.py -t mouse_spatial_brain_section0.h5ad -d mouse_cortex_methods_comparison_log1p_cpm_modified.h5ad -o clusters.npy # supervised labelingYour output will be evaluated using the adjusted mutual information score with respect to the cell ontology annotations. Your overall score is the sum of the improvement in AMI compared to the baseline for the first three commands. The final command is not included in the scoring, since the point is to illustrate how supervised learning can fail to transfer between input domains while unsupervised learning can still find structure in the data. The time of your script will not be part of your overall score, but it must complete within 3 minutes on each data set.