圖擴散變換器是上下文分子設計師
Graph Diffusion Transformers are In-Context Molecular Designers
October 9, 2025
作者: Gang Liu, Jie Chen, Yihan Zhu, Michael Sun, Tengfei Luo, Nitesh V Chawla, Meng Jiang
cs.AI
摘要
情境學習使大型模型能夠從少量示範中適應新任務,但在分子設計領域的應用仍顯有限。現有數據庫如ChEMBL涵蓋了數百萬生物檢測的分子特性,然而每種特性的標記數據依然稀缺。為解決這一限制,我們引入了示範條件擴散模型(DemoDiff),該模型使用少量分子-分數示例而非文本描述來定義任務情境。這些示範指導去噪Transformer生成與目標特性相符的分子。為實現可擴展的預訓練,我們開發了一種新的分子標記器,採用節點對編碼(Node Pair Encoding)在模塊層面表示分子,所需節點數量減少至原來的5.5分之一。我們從多個來源整理了一個包含數百萬情境任務的數據集,涵蓋藥物與材料領域,並在此基礎上預訓練了一個擁有7億參數的模型。在六大類別的33項設計任務中,DemoDiff的表現與比其大100至1000倍的語言模型相當或更優,平均排名達到3.63,而領域特定方法的平均排名則在5.25至10.20之間。這些成果確立了DemoDiff作為分子設計基礎模型的地位,適用於情境分子設計。我們的代碼已公開於https://github.com/liugangcode/DemoDiff。
English
In-context learning allows large models to adapt to new tasks from a few
demonstrations, but it has shown limited success in molecular design. Existing
databases such as ChEMBL contain molecular properties spanning millions of
biological assays, yet labeled data for each property remain scarce. To address
this limitation, we introduce demonstration-conditioned diffusion models
(DemoDiff), which define task contexts using a small set of molecule-score
examples instead of text descriptions. These demonstrations guide a denoising
Transformer to generate molecules aligned with target properties. For scalable
pretraining, we develop a new molecular tokenizer with Node Pair Encoding that
represents molecules at the motif level, requiring 5.5times fewer nodes. We
curate a dataset containing millions of context tasks from multiple sources
covering both drugs and materials, and pretrain a 0.7-billion-parameter model
on it. Across 33 design tasks in six categories, DemoDiff matches or surpasses
language models 100-1000times larger and achieves an average rank of 3.63
compared to 5.25-10.20 for domain-specific approaches. These results position
DemoDiff as a molecular foundation model for in-context molecular design. Our
code is available at https://github.com/liugangcode/DemoDiff.