图扩散变换器是上下文感知的分子设计器
Graph Diffusion Transformers are In-Context Molecular Designers
October 9, 2025
作者: Gang Liu, Jie Chen, Yihan Zhu, Michael Sun, Tengfei Luo, Nitesh V Chawla, Meng Jiang
cs.AI
摘要
上下文学习使大型模型能够通过少量示例适应新任务,但在分子设计领域表现有限。现有数据库如ChEMBL包含数百万生物测定中的分子属性,然而每种属性的标注数据仍显匮乏。为应对这一局限,我们提出了基于示例条件的扩散模型(DemoDiff),其通过少量分子-评分示例而非文本描述来定义任务上下文。这些示例引导去噪Transformer生成与目标属性相符的分子。为实现可扩展的预训练,我们开发了一种新的分子标记器,采用节点对编码在基元层面表示分子,所需节点数减少至原来的5.5分之一。我们整合了涵盖药物与材料的数百万上下文任务数据集,并在此上预训练了一个拥有7亿参数的模型。在六大类别的33项设计任务中,DemoDiff与规模大100至1000倍的语言模型表现相当或更优,平均排名达到3.63,而领域特定方法的平均排名在5.25至10.20之间。这些成果确立了DemoDiff作为分子设计基础模型的地位。我们的代码已发布于https://github.com/liugangcode/DemoDiff。
English
In-context learning allows large models to adapt to new tasks from a few
demonstrations, but it has shown limited success in molecular design. Existing
databases such as ChEMBL contain molecular properties spanning millions of
biological assays, yet labeled data for each property remain scarce. To address
this limitation, we introduce demonstration-conditioned diffusion models
(DemoDiff), which define task contexts using a small set of molecule-score
examples instead of text descriptions. These demonstrations guide a denoising
Transformer to generate molecules aligned with target properties. For scalable
pretraining, we develop a new molecular tokenizer with Node Pair Encoding that
represents molecules at the motif level, requiring 5.5times fewer nodes. We
curate a dataset containing millions of context tasks from multiple sources
covering both drugs and materials, and pretrain a 0.7-billion-parameter model
on it. Across 33 design tasks in six categories, DemoDiff matches or surpasses
language models 100-1000times larger and achieves an average rank of 3.63
compared to 5.25-10.20 for domain-specific approaches. These results position
DemoDiff as a molecular foundation model for in-context molecular design. Our
code is available at https://github.com/liugangcode/DemoDiff.