Mol-R1: 분자 발견에서 명시적 장기 CoT 추론을 향하여

초록

대형 언어 모델(LLM), 특히 DeepSeek-R1과 QWQ와 같은 명시적 장기 사고 연쇄(CoT) 추론 모델은 상식 추론 및 수학적 추론에서 인상적인 성능을 보이며 강력한 추론 능력을 입증했습니다. 그러나 이러한 장기 CoT 추론 모델은 분자 발견과 같은 지식 집약적 영역에서 제한된 능력과 낮은 효율성으로 인해 비판을 받고 있습니다. 이 분야에서의 성공은 분자 구조와 화학 원리를 포함한 도메인 지식에 대한 정확한 이해를 요구하는데, 이는 분자 데이터의 고유한 복잡성과 고품질 전문가 주석의 부족으로 인해 어려운 과제입니다. 이러한 격차를 해소하기 위해, 우리는 텍스트 기반 분자 생성에서 R1과 같은 명시적 장기 CoT 추론 LLM의 설명 가능성과 추론 성능을 향상시키기 위해 Mol-R1이라는 새로운 프레임워크를 소개합니다. 우리의 접근 방식은 사전 규제를 통한 인컨텍스트 증류(PRID)라는 전용 증류 전략을 통해 고품질 추론 데이터셋을 구축하는 것으로 시작합니다. 이를 기반으로, 분자 발견을 위한 R1과 같은 추론 모델의 성능을 향상시키기 위해 지도 미세 조정(SFT)과 강화 정책 최적화(RPO)를 반복적으로 결합하는 정교한 훈련 전략인 MoIA(분자 반복 적응)를 도입합니다. 마지막으로, Mol-R1의 성능을 텍스트 기반 분자 추론 생성 작업에서 평가하며, 기존 베이스라인 대비 우수한 성능을 보여줍니다.

English

Large language models (LLMs), especially Explicit Long Chain-of-Thought (CoT) reasoning models like DeepSeek-R1 and QWQ, have demonstrated powerful reasoning capabilities, achieving impressive performance in commonsense reasoning and mathematical inference. Despite their effectiveness, Long-CoT reasoning models are often criticized for their limited ability and low efficiency in knowledge-intensive domains such as molecule discovery. Success in this field requires a precise understanding of domain knowledge, including molecular structures and chemical principles, which is challenging due to the inherent complexity of molecular data and the scarcity of high-quality expert annotations. To bridge this gap, we introduce Mol-R1, a novel framework designed to improve explainability and reasoning performance of R1-like Explicit Long-CoT reasoning LLMs in text-based molecule generation. Our approach begins with a high-quality reasoning dataset curated through Prior Regulation via In-context Distillation (PRID), a dedicated distillation strategy to effectively generate paired reasoning traces guided by prior regulations. Building upon this, we introduce MoIA, Molecular Iterative Adaptation, a sophisticated training strategy that iteratively combines Supervised Fine-tuning (SFT) with Reinforced Policy Optimization (RPO), tailored to boost the reasoning performance of R1-like reasoning models for molecule discovery. Finally, we examine the performance of Mol-R1 in the text-based molecule reasoning generation task, showing superior performance against existing baselines.

Mol-R1: 분자 발견에서 명시적 장기 CoT 추론을 향하여

Mol-R1: Towards Explicit Long-CoT Reasoning in Molecule Discovery

초록

Support