그래프 확산 트랜스포머는 인-컨텍스트 분자 설계자입니다.

초록

컨텍스트 내 학습(in-context learning)은 대형 모델이 소수의 데모를 통해 새로운 작업에 적응할 수 있게 하지만, 분자 설계 분야에서는 제한된 성공만을 보여왔습니다. ChEMBL과 같은 기존 데이터베이스는 수백만 개의 생물학적 분석을 아우르는 분자 특성을 포함하고 있지만, 각 특성에 대한 레이블 데이터는 여전히 부족합니다. 이러한 한계를 해결하기 위해, 우리는 텍스트 설명 대신 소수의 분자-점수 예제를 사용하여 작업 컨텍스트를 정의하는 데모 조건부 확산 모델(DemoDiff)을 소개합니다. 이러한 데모는 디노이징 트랜스포머가 목표 특성에 맞는 분자를 생성하도록 안내합니다. 확장 가능한 사전 학습을 위해, 우리는 분자를 모티프 수준에서 표현하며 노드 수를 5.5배 줄이는 Node Pair Encoding을 사용한 새로운 분자 토크나이저를 개발했습니다. 우리는 약물과 재료를 모두 포함한 다양한 출처의 수백만 개의 컨텍스트 작업을 포함한 데이터셋을 구축하고, 이를 기반으로 7억 개의 파라미터를 가진 모델을 사전 학습했습니다. 6개 카테고리의 33개 설계 작업에서 DemoDiff는 100-1000배 더 큰 언어 모델과 동등하거나 더 나은 성능을 보였으며, 도메인 특화 접근법의 평균 순위 5.25-10.20에 비해 평균 3.63의 순위를 달성했습니다. 이러한 결과는 DemoDiff를 컨텍스트 내 분자 설계를 위한 분자 기반 모델로 자리매김합니다. 우리의 코드는 https://github.com/liugangcode/DemoDiff에서 확인할 수 있습니다.

English

In-context learning allows large models to adapt to new tasks from a few demonstrations, but it has shown limited success in molecular design. Existing databases such as ChEMBL contain molecular properties spanning millions of biological assays, yet labeled data for each property remain scarce. To address this limitation, we introduce demonstration-conditioned diffusion models (DemoDiff), which define task contexts using a small set of molecule-score examples instead of text descriptions. These demonstrations guide a denoising Transformer to generate molecules aligned with target properties. For scalable pretraining, we develop a new molecular tokenizer with Node Pair Encoding that represents molecules at the motif level, requiring 5.5times fewer nodes. We curate a dataset containing millions of context tasks from multiple sources covering both drugs and materials, and pretrain a 0.7-billion-parameter model on it. Across 33 design tasks in six categories, DemoDiff matches or surpasses language models 100-1000times larger and achieves an average rank of 3.63 compared to 5.25-10.20 for domain-specific approaches. These results position DemoDiff as a molecular foundation model for in-context molecular design. Our code is available at https://github.com/liugangcode/DemoDiff.

그래프 확산 트랜스포머는 인-컨텍스트 분자 설계자입니다.

Graph Diffusion Transformers are In-Context Molecular Designers

초록

Support