Learning generative models for molecules

Drug discovery is a well-known and challenging problem. Typically, one needs to navigate through a vast chemical space of up to \( 10^{60} \) small organic molecules to find a potential drug candidate with desired properties. Such a trial-and-error process is often cumbersome and error-prone. However, there is an abundance of molecule-related data in the real world, and data-driven approaches can be a promising direction to expedite the molecule design stage, which is a pivotal part of the entire chemical development.

Machine learning has revolutionized many fields including computer vision, natural language processing. Despite its current success, machine learning-based algorithms are proficient only at solving the forward problem, in which given a specific input structure there is an associated unique property that we want to predict. In contrast, generating a new molecule to meet specific property requirements is considered as an inverse problem, where there is a one-to-many mapping between the required property generation and structurally diverse molecules. This mapping is highly challenging and constitutes a fundamental aspect of molecule-related inverse problems. Furthermore, the underlying nature of molecules is based on graph structures, and the direct optimization over the discrete representation to find the target generated molecule property is a hard problem in the literature.

In this project, we aim to leverage generative models, a type of method that has demonstrated its effectiveness in solving the encountered problem as well as a more general setting of inverse modeling. The project spans over two overarching objectives:

  1. to develop a single-step generation for a molecule given some expected properties;
  2. to do style transfer.

The latter objective seeks to modify the structure of a given molecule with known properties to a new molecule by slightly adjusting the provided molecule structure to exhibit some desired characteristics. Previous studies in the literature solved these problems with a two-stage process that includes molecule property optimization in the second stage. However, this additional process is computationally prohibitive and requires retraining different target values for inference, which poses significant challenges.

Another limitation of such property optimization is that it necessitates operating in continuous latent spaces where optimizations are feasible. While operating in continuous latent spaces might be problematic for discrete-data learning problems, it can somehow destroy the semantic structure of the molecule's nature. To enable a single-step generation objective, one should lead to optimal training paradigms where we can not be constrained by the extra optimization anymore. Various training paradigms can lead to different choices of generative models, which can be variational autoencoders, normalizing flows, or recently diffusion-based models. The two last choices are relatively under-explored for the molecule generation problem, therefore it will be our focus to exploit the expressive power of these recent generative models for learning discrete-structure data. Conditional generation and style transfer can be amenable directions to achieve a faster molecule property generation with higher precision compared to the two-stage approaches. However, such directions if not carefully designed will lead to pathological issues where conditioning factors have a weak information content regarding the generated structure, and often get ignored by models. Thus, a specific aim of this project is to design appropriate objective functions that ensure to use of all model components as intended, avoiding reductions to trivial cases. In the end, the models developed in this project are expected to be of direct interest to other domains where inverse problems arise, including robotics, physical process modeling, and beyond. The project is fully funded through a Swiss National Science Foundation grant.

It this project, we will utilise expertise and insights gained in our previous relevant project SMELL.