/
Chemical-Reaction-Aware Molecule Representation Learning Chemical-Reaction-Aware Molecule Representation Learning

Chemical-Reaction-Aware Molecule Representation Learning - PowerPoint Presentation

amber
amber . @amber
Follow
343 views
Uploaded On 2022-06-01

Chemical-Reaction-Aware Molecule Representation Learning - PPT Presentation

Hongwei Wang Weijiang Li Xiaomeng Jin Kyunghyun Cho Heng Ji Jiawei Han Martin D Burke October 28 2021 Molecule representation 2hydroxypropanoic acid IUPAC nomenclature Molecular formula ID: 912759

reaction molecule chemical embedding molecule reaction embedding chemical prediction smiles model experiments set graph molecules gnn atom based acid

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Chemical-Reaction-Aware Molecule Represe..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Chemical-Reaction-Aware Molecule Representation Learning

Hongwei Wang,

Weijiang

Li,

Xiaomeng

Jin

,

Kyunghyun

Cho, Heng Ji, Jiawei Han, Martin D. Burke

October 28, 2021

Slide2

Molecule representation

2-hydroxypropanoic acid

IUPAC nomenclature

Molecular formula

C

3

H6O3

Structural formula

CH3-CH-C-OH

OH

 

Space filling model

Ball-and-stick model

O

 

Slide3

Molecule representation learning

molecule

Graph encoder

embedding

Downstream tasks

Chemical reaction prediction

Molecule property prediction

Molecule generation

Drug discovery

Retrosynthesis planning

Chemical text mining

Chemical knowledge graph modeling

……

Human-readable representations of molecules are hard to read by machines

The goal of molecule representation learning (MRL) is to map a molecule into a low-dimensional vector space

The learned molecule embeddings can benefit a variety of downstream tasks

Slide4

SMILES

The Simplified Molecular-Input Line-Entry System (SMILES)

is used to describe the structure of molecules using short ASCII strings

Structural formula

SMILES

CH3

-CH-C-OHOH

 

O

 

CC(O)C(=O)O

Oc1ccccc1

OC1=CC=CC=C1

Slide5

SMILES

The Simplified Molecular-Input Line-Entry System (SMILES)

is used to describe the structure of molecules using short ASCII strings

Structural formula

Determining the main sequence and branches

Breaking rings

Slide6

SMILES-based MRL methods

SMILES-based MRL methods…

take SMILES strings as input

use language models (BERT, Transformer) as their base models

output hidden layers as molecule embeddings

Illustration of SMILES-Transformer (Honda et al., 2019)

Examples:

MolBERT

(Fabian et al., 2020)

ChemBERTa

(

Chithrananda

et al., 2020)

SMILES-Transformer (Honda et al., 2019)

SMILES-BERT (Wang et al., 2019)

Molecule-Transformer (Shin et al., 2019)SA-BiLSTM

(Zheng et al., 2019)

Slide7

Limitation of Smiles-based mrl methods

Smiles are 1D linearization of molecule structures, which makes them hard to learn the original structural information of molecules

CC(CCCCCCC

O

)=

O

These two O’s are close in SMILES string…

…but actually they are far from each other

Slide8

Graph neural networks (GNN)

 

: initial feature of node

: hidden state of node

in layer k

 

 

 

 

 

 

 

 

 

 

for

:

for each node

:

 

for each node

 

return

for each node

 

GNNs follow a neighborhood aggregation strategy:

Examples:

Weisfeiler

-Lehman Network (

Jin

et al., 2017)

Message Passing Neural Network (Gilmer et al. 2017)

Slide9

Gnn-based mrl methods

 

1.

Propagating messages over the graph

In the

-

th

iteration, aggregating neighborhood information and update the embedding for atom

:

 

 

2.

Read out the molecule graph embedding

After

iteration, using a readout function to aggregate all aom embeddings and return the whole graph embedding:

 

Slide10

Limitation of GNN-based mrl methods

GNN-based methods are theoretically superior to SMILES-based methods in learning molecule structure, but…

They are limited to designing fresh and delicate GNN architectures while ignoring the essence of MRL, which is

generalization ability

There is no specific GNN that performs universally best in all downstream tasks of MRL

New GNN architectures cannot essentially improve the performance of MRL

Slide11

Structural molecule encoder

We use GNNs as the molecule encoder

Each atom has an initial feature vector consisting of four parts:

Element type

Charge

Whether the atom is in an aromatic ringThe count of attached hydrogen atom(s)No edge feature (i.e., bond type) is considered since…

Bond type can be inferred by the features of its two associated atomsBond type does not consistently improve the model performance

Slide12

Structural molecule encoder

Element type

: C

Charge

: 0

Whether this atom is in an aromatic ring: TrueThe count of attached hydrogen atom(s): 1

One-hot encoding

 

Element type

Charge

Aromatic

# H atom(s)

Slide13

Structural molecule encoder

N

O

O

 

2

3

4

5

1

6

7

8

N

O

O

 

2

3

4

5

1

6

7

8

For

:

 

proline

Slide14

Preserving chemical reaction equivalence

A chemical reaction defines a particular relation “

between reactant set

and product set

:

 

Several physical quantities retain constant before and after the reaction

Mass, energy, charge, etc.

We aim to preserve such equivalence in the molecule embedding space:

Slide15

Example

 

 

 

Acetaldehyde

C

2

H

5

OH+ O

2

→ CH

3

CHO

Ethanol

C

2

H

5

OH

CH

3

CHO

O

2

Slide16

example

Chemical reaction space:

Molecule embedding space:

C

2

H

5

OH+ O2 → CH3CHO

 

Slide17

” Is an equivalence relation

 

is the set of molecules

and

are the reactant set and product set of a chemical reaction, respectively

If

for all chemical reactions, then “

” is an

equivalence relation

on

that satisfies the following three properties:

Reflexivity:

, for all

;

Symmetry:

, for all

;

Transitivity: if

and

, then

, for all

.

 

Proposition 1

Slide18

” Is an equivalence relation

 

is naturally split into

equivalence classes

based on the equivalence relation “

For all molecule sets within one equivalent class, the sum of embeddings of all molecules they consist of should be equal

The equivalence constraint forms a system of linear equations, which imposes strong constraint on molecule embeddingsAs a result, the whole embedding space will be more organized

 

Example:

,

, and

belong to one equivalence class

 

Slide19

Reaction template

The reaction center of

is defined as an induced subgraph of reactants

, in which each atom has at least one bond whose type differs from

to

“no bond” is also seen as a bond type

 

C

C

C

O

O

propionic acid

propanol

C

O

C

C

C

C

C

O

O

propionic acid

water

C

O

C

C

Slide20

Reaction template

is a chemical reaction where

is the reactant set and

is the product set, and

is its reaction center

Suppose that the layer of GNN is

, and the READOUT function is summation

Then for an arbitrary atom

in one of the reactant whose final representation is

, the residual term

is a function of

if and only if the distance between atom

and reaction center

is less than

 

Proposition 2

C

C

C

O

O

propionic acid

propanol

C

O

C

C

C

C

C

O

O

propionic acid

water

C

O

C

C

 

Slide21

Reaction template

If

, then

will also holds for any functional group

and

 

C

C

C

O

O

propionic acid

propanol

C

O

C

C

C

C

C

O

O

propionic acid

water

C

O

C

C

 

R

1

C

C

O

O

R

2

O

C

C

C

C

O

O

O

C

C

R

1

R

2

Reaction template:

 

Slide22

Reaction template

The learned reaction templates are essential to improving generalization ability

The model can easily apply this knowledge to reactions that are unseen in training data but comply with a known reaction template

C

C

C

O

O

propionic acid

propanol

C

O

C

C

C

C

C

O

O

propionic acid

water

C

O

C

C

 

R

1

C

C

O

O

R

2

O

C

C

C

C

O

O

O

C

C

R

1

R

2

Reaction template:

 

Slide23

Training the model

Simply minimizing

doesn’t work

 

reactants

 

products

 

GNN encoder

Minibatch-based contrastive loss for a minibatch of data

 

Slide24

Experiments: Chemical reaction prediction

USPTO-479k

Training set: 409k, validation set: 30k, test set: 40k

E

ach reaction contains SMILES strings of up to five reactant(s) and exactly one productFormat:

reactant_smiles product_smiles0 CC(C)C[Mg+].CON(C)C(=O)c1ccc(O)nc1 CC(C)CC(=O)c1ccc(O)nc11 CN.O=C(O)c1ccc(Cl)c([N+](=O)[O-])c1 CNc1ccc(C(=O)O)cc1[N+](=O)[O-]CCn1cc(C(=O)O)c(=O)c2cc(F)c(-c3ccc(N)cc3)cc21.O=CO CCn1cc(C(=O)O)c(=O)c2cc(F)c(-c3ccc(NC=O)cc3)cc21

……Dataset

Slide25

Experiments: Chemical reaction prediction

Query reactant(s)

 

Candidate products

 

Trained model

Trained model

Query embedding

[candidate embedding_1, candidate embedding_2, …]

Calculating Euclidean distance

and ranking candidates

Evaluation metrics: MRR, MR, Hit@1, 3, 5, 10

 

Evaluation Protocol

Slide26

Experiments: Chemical reaction prediction

Mol2vec (Jaeger et al. 2018)

MolBERT

(Fabian et al. 2020)

Mol2vec-FT1 and MolBERT-FT1:

Freeze the model parameters but train a diagonal matrix to rank candidates:

MolBERT-FT2:

F

inetunes its model parameters by minimizing the contrastive loss

Mol2vec is not an end-to-end model thus canont be finetuned using this strategy

 

Baselines

Slide27

Experiments: Chemical reaction prediction

Using products in the test set as candidates

Slide28

Experiments: Chemical reaction prediction

Using products in the test set as candidates

Slide29

Experiments: Chemical reaction prediction

Using products in the test set as candidates

Slide30

Experiments: Chemical reaction prediction

Using products in the test set as candidates

Slide31

Experiments: Chemical reaction prediction

Sensitivity of

MolR

-GCN on # GNN layers, dimension of embedding,

, and batch size

 

Slide32

Experiments: Chemical reaction prediction

Result on Real Multi-choice Questions on Product Prediction

We select 16 multi-choice questions on product prediction collected from online resources of Oxford University Press,

mhpraticeplus.com

, and GRE Chemistry Test Practice Book

Example question:

Slide33

Experiments: Chemical reaction prediction

Case Study on USPTO-479k

Slide34

Experiments: molecule property prediction

Pretrained model

[(embedding_1, label_1), (embedding_2, label_2), …]

All molecules in a dataset

train : validation : test = 8:1:1

Classification model

Slide35

Experiments: molecule property prediction

Experimental results on AUC

The result in the first three blocks is taken from (

Honda et al., 2019

), (

Chithrananda

et al., 2020), (Fabian et al., 2020

), respectively, while the result in the last two blocks is reported by us.

Slide36

Experiments: Graph-edit-distance prediction

Graph edit distance (GED) is a measure of similarity between two graphs, which measures the minimum number of graph edit operations to transform one graph to another

The allowed graph edit operations includes: insert/delete an isolated node, insert/delete an edge, change the feature of a node/edge

Calculating exact GED is NP-hard

.

Delete an edge

Delete a node

Change a node

Change an edge

GED( , ) = 4

Slide37

Experiments: Graph-edit-distance prediction

Pretrained model

[

(embedding_1, embedding_2, GED), …

]

[

( , , 10),

( , , 22), …

]

Concatenating or subtracting two embeddings as features

[

(feature, GED), …

]

Regression model

[

(mol_1, mol_2, GED), …

]

train : validation : test = 8:1:1

Slide38

Experiments: Graph-edit-distance prediction

Experimental results on RMSE

The purpose of the experiment is to show that if the learned embeddings are able to preserve the structural similarity between molecules in the embedding space

Randomly sampling 10k pairs from the first 1k molecules in QM9 dataset

Using

networkx.algorithms.similarity.graph_edit_distance() to calculate the ground truth

Interval of ground-truth GEDs: [1, 14]

Slide39

Experiments: Embedding visualization

Reaction-awareness

Alcohol oxidation:

R-CH

2

OH + O2 R-CHO + H2

OAldehyde oxidation:R-CHO + O2 R-COOH

 

Slide40

Embedding visualization

Molecule property

p_np

(permeable or not) for BBBP dataset

Communities of non-permeable molecules

Slide41

Embedding visualization

GED

w.r.t.

to #1196 molecule in BBBP dataset

# 1196 molecule in BBBP

Molecules that are structurally similar to #1196 are also close to it in embedding space

Molecules that are structurally dissimilar to #1196 are also far from it in embedding space

Slide42

Embedding visualization

Molecule sizes (# non-hydrogen atoms) for BBBP dataset

Size increasing

Small-molecule region

Large-molecule region

Slide43

Embedding visualization

# smallest rings for BBBP dataset

# rings increasing

Multi-ring molecule region

No-ring molecule region

Slide44

takeaways

We use

GNNs

as the molecule encoder, and use

chemical reactions to assist learning molecule representationsThe sum of reactant embeddings and the sum of product embeddings are forced to be equal

We prove that our model is able to learn reaction templates that are essential to improve the generalization abilityOur model is shown to benefit a wide range of downstream tasks

The visualized results demonstrate that the learned embeddings are well-organized and reaction-aware

Slide45

Q & A

Thanks!