Assign7: A Smiling Variational Autoencoder

Checkpoint Friday, March 29, 11:59pm Monday, April 1, 11:59pm

Due Monday, April 8 Tuesday, April 9, 11:59pm

Submit

For this assignment you will use PyTorch to design and train a Variational Autoencoder (VAE) to generate SMILES strings. This will give experience training both generative and recurrent models.

Data

The source data comes from Pubchem.

We have filtered all the SMILES in PubChem to include only those with a reduced alphabet of characters (e.g., no "weird" elements like selenium, molecules are Kekulized so there are no aromatic characters) with a length less than 150 characters. Filtering the PubChem SMILES strings with these criteria leaves 62,129,800 valid SMILES strings. We provide these SMILES both in canonical form (here) and (somewhat) randomized (here). The code to randomize smiles can be found here, in case you wish to further augment this data.

Training

We will be using PyTorch to define and train our VAE. Your model will be evaluated using PyTorch version 2.0.0.

To get you started, we are providing you with a skeleton VAE model with a working Dataset class: train_VAE.py.
You will want to re-write the Dataset class to load preprocessed data (e.g. a saved numpy array) for faster initialization. You may use whatever architecture you think best for the encoder and decoder (e.g. CNN, GRU, Transformer - don't forget the positional encoding), but your decoder must take a 1024 sized vector sampled around a unit Gaussian and generate a SMILES string with max length 150.

Here are some VAE papers that you can use to model your own VAE after: Tutorial on Variational Autoencoders, A Neural Representation of Sketch Drawings, PixelVAE: A Latent Variable Model For Natural Images, and Generating Sentences from a Continuous Space.

Here are repositories/Pytorch tutorials that may be helpful in constructing your own VAE: Pytorch Seq2Seq Tutorial, Pytorch Sentence-VAE, Pytorch Sketch-RNN.

Cluster Etiquette Please read this document paying attention to the "Cluster Etiquette" and "Consideration of Others" sections. You may (and should) use both the CSB and CRC clusters to complete this assignment.

Tips

Evaluation

Once you have a trained model you are happy with, you should submit a link to the model to the evaluation server (below). The link must be a world readable Google Storage location (gs://). You can copy your model output to a storage bucket with gsutil:
gsutil acl ch -r -u AllUsers:R gs://bucket-name/
gutil cp -r model.pth gs://bucket-name/
World readable http:// URLs should also work (will be fetched using wget). The model will be evaluated with 1000 random vectors sampled from the latent space using the procedure in eval.py. Note we sample from a normal distribution with a zero mean and unit variance. Make sure you test your model using the same procedure as in this file. Your model will be evaluated with PyTorch 2.0.0 on an NVIDIA TITAN RTX GPU (24 GB).

We will compute the following metrics (all out of 1000 samples):
UniqueSMIThe number of unique strings generated
ValidSMIThe number of (not unique) valid SMILES strings generated (can be parsed by RDKit)
AveRingsThe average number of rings in each unique valid molecule (this is a simple measure of chemical complexity)
UniqueValidMolsThe number of valid and unique molecules (distinct SMILES strings that represent the same molecule are only counted once)
NovelMolsThe number of valid and unique molecules that are not in the training set
ZeroSMIThe molecule generated from the zero latent vector

Checkpoint

You should have a model that generates at least one valid molecule submitted to the leaderboard by the checkpoint date or you will receive a 10 point penalty.

Grading

Your decoder must generate non-trivial molecular structures (AveRings > 0). The key metric for grading is NovelMols. Your grade will be the percent NovelMols plus 15, with no capping at 100. For example, if you generate 912 NovelMols, you grade will be 91.2+15=106.

@pitt.edu
e.g. vae_generate.pth