GTR-CoT: 분자 구조 인식을 위한 시각적 사고 사슬로서의 그래프 순회

초록

광학적 화학 구조 인식(OCSR)은 분자 이미지를 기계가 읽을 수 있는 형식으로 변환함으로써 화학 지식의 디지털화에 있어 매우 중요합니다. 최근 비전-언어 모델(VLMs)이 이 작업에서 잠재력을 보여주고 있지만, 이미지 캡셔닝 접근 방식은 복잡한 분자 구조와 일관되지 않은 주석에서 어려움을 겪는 경우가 많습니다. 이러한 문제를 극복하기 위해, 우리는 두 가지 주요 혁신을 특징으로 하는 GTR-Mol-VLM이라는 새로운 프레임워크를 소개합니다: (1) 인간의 추론을 모방하여 순차적인 원자-결합 예측을 통해 분자 그래프를 점진적으로 파싱하는 그래프 순회 시각적 사고 체인(Visual Chain of Thought) 메커니즘, 그리고 (2) 이미지에서의 축약된 구조와 확장된 주석 간의 불일치를 해결하는 데이터 중심 원칙인 "본 것을 충실히 인식하라(Faithfully Recognize What You've Seen)". 모델 개발을 지원하기 위해, 우리는 정밀하게 수정된 주석을 포함한 대규모 지침 튜닝 데이터셋인 GTR-CoT-1.3M을 구축하고, OCSR에서 그래프 파싱 정확도를 세밀하게 평가하기 위한 최초의 벤치마크인 MolRec-Bench를 도입했습니다. 포괄적인 실험 결과, GTR-Mol-VLM은 전문가 모델, 화학 도메인 VLMs, 그리고 상용 범용 VLMs에 비해 우수한 성능을 달성함을 보여줍니다. 특히, 기능 그룹 축약이 포함된 분자 이미지 시나리오에서 GTR-Mol-VLM은 SMILES 기반 및 그래프 기반 지표 모두에서 두 번째로 우수한 베이스라인을 약 14% 포인트 앞섰습니다. 우리는 이 연구가 OCSR 기술이 현실 세계의 요구를 더 효과적으로 충족하도록 이끌어, 화학정보학과 과학을 위한 AI 분야를 발전시키기를 바랍니다. GTR-CoT는 https://github.com/opendatalab/GTR-CoT에서 공개될 예정입니다.

English

Optical Chemical Structure Recognition (OCSR) is crucial for digitizing chemical knowledge by converting molecular images into machine-readable formats. While recent vision-language models (VLMs) have shown potential in this task, their image-captioning approach often struggles with complex molecular structures and inconsistent annotations. To overcome these challenges, we introduce GTR-Mol-VLM, a novel framework featuring two key innovations: (1) the Graph Traversal as Visual Chain of Thought mechanism that emulates human reasoning by incrementally parsing molecular graphs through sequential atom-bond predictions, and (2) the data-centric principle of Faithfully Recognize What You've Seen, which addresses the mismatch between abbreviated structures in images and their expanded annotations. To support model development, we constructed GTR-CoT-1.3M, a large-scale instruction-tuning dataset with meticulously corrected annotations, and introduced MolRec-Bench, the first benchmark designed for a fine-grained evaluation of graph-parsing accuracy in OCSR. Comprehensive experiments demonstrate that GTR-Mol-VLM achieves superior results compared to specialist models, chemistry-domain VLMs, and commercial general-purpose VLMs. Notably, in scenarios involving molecular images with functional group abbreviations, GTR-Mol-VLM outperforms the second-best baseline by approximately 14 percentage points, both in SMILES-based and graph-based metrics. We hope that this work will drive OCSR technology to more effectively meet real-world needs, thereby advancing the fields of cheminformatics and AI for Science. We will release GTR-CoT at https://github.com/opendatalab/GTR-CoT.

GTR-CoT: 분자 구조 인식을 위한 시각적 사고 사슬로서의 그래프 순회

GTR-CoT: Graph Traversal as Visual Chain of Thought for Molecular Structure Recognition

초록

Support