Hongwei Wang Weijiang Li Xiaomeng Jin Kyunghyun Cho Heng Ji Jiawei Han Martin D Burke October 28 2021 Molecule representation 2hydroxypropanoic acid IUPAC nomenclature Molecular formula ID: 912759
Download Presentation The PPT/PDF document "Chemical-Reaction-Aware Molecule Represe..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Chemical-Reaction-Aware Molecule Representation Learning
Hongwei Wang,
Weijiang
Li,
Xiaomeng
Jin
,
Kyunghyun
Cho, Heng Ji, Jiawei Han, Martin D. Burke
October 28, 2021
Slide2Molecule representation
2-hydroxypropanoic acid
IUPAC nomenclature
Molecular formula
C
3
H6O3
Structural formula
CH3-CH-C-OH
OH
Space filling model
Ball-and-stick model
O
Molecule representation learning
molecule
Graph encoder
embedding
Downstream tasks
Chemical reaction prediction
Molecule property prediction
Molecule generation
Drug discovery
Retrosynthesis planning
Chemical text mining
Chemical knowledge graph modeling
……
Human-readable representations of molecules are hard to read by machines
The goal of molecule representation learning (MRL) is to map a molecule into a low-dimensional vector space
The learned molecule embeddings can benefit a variety of downstream tasks
Slide4SMILES
The Simplified Molecular-Input Line-Entry System (SMILES)
is used to describe the structure of molecules using short ASCII strings
Structural formula
SMILES
CH3
-CH-C-OHOH
O
CC(O)C(=O)O
Oc1ccccc1
OC1=CC=CC=C1
Slide5SMILES
The Simplified Molecular-Input Line-Entry System (SMILES)
is used to describe the structure of molecules using short ASCII strings
Structural formula
Determining the main sequence and branches
Breaking rings
Slide6SMILES-based MRL methods
SMILES-based MRL methods…
take SMILES strings as input
use language models (BERT, Transformer) as their base models
output hidden layers as molecule embeddings
Illustration of SMILES-Transformer (Honda et al., 2019)
Examples:
MolBERT
(Fabian et al., 2020)
ChemBERTa
(
Chithrananda
et al., 2020)
SMILES-Transformer (Honda et al., 2019)
SMILES-BERT (Wang et al., 2019)
Molecule-Transformer (Shin et al., 2019)SA-BiLSTM
(Zheng et al., 2019)
Slide7Limitation of Smiles-based mrl methods
Smiles are 1D linearization of molecule structures, which makes them hard to learn the original structural information of molecules
CC(CCCCCCC
O
)=
O
These two O’s are close in SMILES string…
…but actually they are far from each other
Slide8Graph neural networks (GNN)
: initial feature of node
: hidden state of node
in layer k
for
:
for each node
:
for each node
return
for each node
GNNs follow a neighborhood aggregation strategy:
Examples:
Weisfeiler
-Lehman Network (
Jin
et al., 2017)
Message Passing Neural Network (Gilmer et al. 2017)
Slide9Gnn-based mrl methods
1.
Propagating messages over the graph
In the
-
th
iteration, aggregating neighborhood information and update the embedding for atom
:
2.
Read out the molecule graph embedding
After
iteration, using a readout function to aggregate all aom embeddings and return the whole graph embedding:
Limitation of GNN-based mrl methods
GNN-based methods are theoretically superior to SMILES-based methods in learning molecule structure, but…
They are limited to designing fresh and delicate GNN architectures while ignoring the essence of MRL, which is
generalization ability
There is no specific GNN that performs universally best in all downstream tasks of MRL
New GNN architectures cannot essentially improve the performance of MRL
Slide11Structural molecule encoder
We use GNNs as the molecule encoder
Each atom has an initial feature vector consisting of four parts:
Element type
Charge
Whether the atom is in an aromatic ringThe count of attached hydrogen atom(s)No edge feature (i.e., bond type) is considered since…
Bond type can be inferred by the features of its two associated atomsBond type does not consistently improve the model performance
Slide12Structural molecule encoder
Element type
: C
Charge
: 0
Whether this atom is in an aromatic ring: TrueThe count of attached hydrogen atom(s): 1
One-hot encoding
Element type
Charge
Aromatic
# H atom(s)
Slide13Structural molecule encoder
N
O
O
2
3
4
5
1
6
7
8
N
O
O
2
3
4
5
1
6
7
8
For
:
proline
Slide14Preserving chemical reaction equivalence
A chemical reaction defines a particular relation “
”
between reactant set
and product set
:
Several physical quantities retain constant before and after the reaction
Mass, energy, charge, etc.
We aim to preserve such equivalence in the molecule embedding space:
Slide15Example
Acetaldehyde
C
2
H
5
OH+ O
2
→ CH
3
CHO
Ethanol
C
2
H
5
OH
CH
3
CHO
O
2
Slide16example
Chemical reaction space:
Molecule embedding space:
C
2
H
5
OH+ O2 → CH3CHO
“
” Is an equivalence relation
is the set of molecules
and
are the reactant set and product set of a chemical reaction, respectively
If
for all chemical reactions, then “
” is an
equivalence relation
on
that satisfies the following three properties:
Reflexivity:
, for all
;
Symmetry:
, for all
;
Transitivity: if
and
, then
, for all
.
Proposition 1
Slide18“
” Is an equivalence relation
is naturally split into
equivalence classes
based on the equivalence relation “
”
For all molecule sets within one equivalent class, the sum of embeddings of all molecules they consist of should be equal
The equivalence constraint forms a system of linear equations, which imposes strong constraint on molecule embeddingsAs a result, the whole embedding space will be more organized
Example:
,
, and
belong to one equivalence class
Reaction template
The reaction center of
is defined as an induced subgraph of reactants
, in which each atom has at least one bond whose type differs from
to
“no bond” is also seen as a bond type
C
C
C
O
O
propionic acid
propanol
C
O
C
C
C
C
C
O
O
propionic acid
water
C
O
C
C
Slide20Reaction template
is a chemical reaction where
is the reactant set and
is the product set, and
is its reaction center
Suppose that the layer of GNN is
, and the READOUT function is summation
Then for an arbitrary atom
in one of the reactant whose final representation is
, the residual term
is a function of
if and only if the distance between atom
and reaction center
is less than
Proposition 2
C
C
C
O
O
propionic acid
propanol
C
O
C
C
C
C
C
O
O
propionic acid
water
C
O
C
C
Reaction template
If
, then
will also holds for any functional group
and
C
C
C
O
O
propionic acid
propanol
C
O
C
C
C
C
C
O
O
propionic acid
water
C
O
C
C
R
1
C
C
O
O
R
2
O
C
C
C
C
O
O
O
C
C
R
1
R
2
Reaction template:
Reaction template
The learned reaction templates are essential to improving generalization ability
The model can easily apply this knowledge to reactions that are unseen in training data but comply with a known reaction template
C
C
C
O
O
propionic acid
propanol
C
O
C
C
C
C
C
O
O
propionic acid
water
C
O
C
C
R
1
C
C
O
O
R
2
O
C
C
C
C
O
O
O
C
C
R
1
R
2
Reaction template:
Training the model
Simply minimizing
doesn’t work
reactants
products
GNN encoder
…
…
…
…
…
…
…
…
…
…
…
Minibatch-based contrastive loss for a minibatch of data
Experiments: Chemical reaction prediction
USPTO-479k
Training set: 409k, validation set: 30k, test set: 40k
E
ach reaction contains SMILES strings of up to five reactant(s) and exactly one productFormat:
reactant_smiles product_smiles0 CC(C)C[Mg+].CON(C)C(=O)c1ccc(O)nc1 CC(C)CC(=O)c1ccc(O)nc11 CN.O=C(O)c1ccc(Cl)c([N+](=O)[O-])c1 CNc1ccc(C(=O)O)cc1[N+](=O)[O-]CCn1cc(C(=O)O)c(=O)c2cc(F)c(-c3ccc(N)cc3)cc21.O=CO CCn1cc(C(=O)O)c(=O)c2cc(F)c(-c3ccc(NC=O)cc3)cc21
……Dataset
Slide25Experiments: Chemical reaction prediction
Query reactant(s)
Candidate products
Trained model
Trained model
Query embedding
[candidate embedding_1, candidate embedding_2, …]
Calculating Euclidean distance
and ranking candidates
Evaluation metrics: MRR, MR, Hit@1, 3, 5, 10
Evaluation Protocol
Slide26Experiments: Chemical reaction prediction
Mol2vec (Jaeger et al. 2018)
MolBERT
(Fabian et al. 2020)
Mol2vec-FT1 and MolBERT-FT1:
Freeze the model parameters but train a diagonal matrix to rank candidates:
MolBERT-FT2:
F
inetunes its model parameters by minimizing the contrastive loss
Mol2vec is not an end-to-end model thus canont be finetuned using this strategy
Baselines
Slide27Experiments: Chemical reaction prediction
Using products in the test set as candidates
Slide28Experiments: Chemical reaction prediction
Using products in the test set as candidates
Slide29Experiments: Chemical reaction prediction
Using products in the test set as candidates
Slide30Experiments: Chemical reaction prediction
Using products in the test set as candidates
Slide31Experiments: Chemical reaction prediction
Sensitivity of
MolR
-GCN on # GNN layers, dimension of embedding,
, and batch size
Experiments: Chemical reaction prediction
Result on Real Multi-choice Questions on Product Prediction
We select 16 multi-choice questions on product prediction collected from online resources of Oxford University Press,
mhpraticeplus.com
, and GRE Chemistry Test Practice Book
Example question:
Slide33Experiments: Chemical reaction prediction
Case Study on USPTO-479k
Slide34Experiments: molecule property prediction
Pretrained model
[(embedding_1, label_1), (embedding_2, label_2), …]
All molecules in a dataset
train : validation : test = 8:1:1
Classification model
Slide35Experiments: molecule property prediction
Experimental results on AUC
The result in the first three blocks is taken from (
Honda et al., 2019
), (
Chithrananda
et al., 2020), (Fabian et al., 2020
), respectively, while the result in the last two blocks is reported by us.
Slide36Experiments: Graph-edit-distance prediction
Graph edit distance (GED) is a measure of similarity between two graphs, which measures the minimum number of graph edit operations to transform one graph to another
The allowed graph edit operations includes: insert/delete an isolated node, insert/delete an edge, change the feature of a node/edge
Calculating exact GED is NP-hard
.
Delete an edge
Delete a node
Change a node
Change an edge
GED( , ) = 4
Slide37Experiments: Graph-edit-distance prediction
Pretrained model
[
(embedding_1, embedding_2, GED), …
]
[
( , , 10),
( , , 22), …
]
Concatenating or subtracting two embeddings as features
[
(feature, GED), …
]
Regression model
[
(mol_1, mol_2, GED), …
]
train : validation : test = 8:1:1
Slide38Experiments: Graph-edit-distance prediction
Experimental results on RMSE
The purpose of the experiment is to show that if the learned embeddings are able to preserve the structural similarity between molecules in the embedding space
Randomly sampling 10k pairs from the first 1k molecules in QM9 dataset
Using
networkx.algorithms.similarity.graph_edit_distance() to calculate the ground truth
Interval of ground-truth GEDs: [1, 14]
Slide39Experiments: Embedding visualization
Reaction-awareness
Alcohol oxidation:
R-CH
2
OH + O2 R-CHO + H2
OAldehyde oxidation:R-CHO + O2 R-COOH
Embedding visualization
Molecule property
p_np
(permeable or not) for BBBP dataset
Communities of non-permeable molecules
Slide41Embedding visualization
GED
w.r.t.
to #1196 molecule in BBBP dataset
# 1196 molecule in BBBP
Molecules that are structurally similar to #1196 are also close to it in embedding space
Molecules that are structurally dissimilar to #1196 are also far from it in embedding space
Slide42Embedding visualization
Molecule sizes (# non-hydrogen atoms) for BBBP dataset
Size increasing
Small-molecule region
Large-molecule region
Slide43Embedding visualization
# smallest rings for BBBP dataset
# rings increasing
Multi-ring molecule region
No-ring molecule region
Slide44takeaways
We use
GNNs
as the molecule encoder, and use
chemical reactions to assist learning molecule representationsThe sum of reactant embeddings and the sum of product embeddings are forced to be equal
We prove that our model is able to learn reaction templates that are essential to improve the generalization abilityOur model is shown to benefit a wide range of downstream tasks
The visualized results demonstrate that the learned embeddings are well-organized and reaction-aware
Slide45Q & A
Thanks!