ChatPaper.aiChatPaper

GTR-CoT:圖形遍歷作為分子結構識別的視覺思維鏈

GTR-CoT: Graph Traversal as Visual Chain of Thought for Molecular Structure Recognition

June 9, 2025
作者: Jingchao Wang, Haote Yang, Jiang Wu, Yifan He, Xingjian Wei, Yinfan Wang, Chengjin Liu, Lingli Ge, Lijun Wu, Bin Wang, Dahua Lin, Conghui He
cs.AI

摘要

光學化學結構識別(OCSR)對於將化學知識數位化至關重要,它能夠將分子圖像轉換為機器可讀的格式。儘管近期的視覺-語言模型(VLMs)在此任務中展現出潛力,但其基於圖像-字幕的方法在處理複雜分子結構和不一致的註釋時往往面臨困難。為克服這些挑戰,我們提出了GTR-Mol-VLM,一個具有兩項關鍵創新的新框架:(1)圖形遍歷作為視覺思維鏈機制,通過逐步解析分子圖來模擬人類推理,實現原子-鍵的序列預測;(2)數據中心原則“忠實識別所見”,解決了圖像中縮寫結構與其擴展註釋之間的不匹配問題。為支持模型開發,我們構建了GTR-CoT-1.3M,一個大規模的指令調優數據集,其註釋經過精心校正,並引入了MolRec-Bench,這是首個專為OCSR中圖形解析精細評估設計的基準。全面的實驗表明,GTR-Mol-VLM在與專業模型、化學領域VLMs及商業通用VLMs的比較中取得了優異成績。特別是在涉及帶有功能基團縮寫的分子圖像場景中,GTR-Mol-VLM在基於SMILES和基於圖形的指標上均比次佳基線高出約14個百分點。我們希望這項工作能推動OCSR技術更有效地滿足現實需求,從而促進化學信息學和科學AI領域的發展。我們將在https://github.com/opendatalab/GTR-CoT發布GTR-CoT。
English
Optical Chemical Structure Recognition (OCSR) is crucial for digitizing chemical knowledge by converting molecular images into machine-readable formats. While recent vision-language models (VLMs) have shown potential in this task, their image-captioning approach often struggles with complex molecular structures and inconsistent annotations. To overcome these challenges, we introduce GTR-Mol-VLM, a novel framework featuring two key innovations: (1) the Graph Traversal as Visual Chain of Thought mechanism that emulates human reasoning by incrementally parsing molecular graphs through sequential atom-bond predictions, and (2) the data-centric principle of Faithfully Recognize What You've Seen, which addresses the mismatch between abbreviated structures in images and their expanded annotations. To support model development, we constructed GTR-CoT-1.3M, a large-scale instruction-tuning dataset with meticulously corrected annotations, and introduced MolRec-Bench, the first benchmark designed for a fine-grained evaluation of graph-parsing accuracy in OCSR. Comprehensive experiments demonstrate that GTR-Mol-VLM achieves superior results compared to specialist models, chemistry-domain VLMs, and commercial general-purpose VLMs. Notably, in scenarios involving molecular images with functional group abbreviations, GTR-Mol-VLM outperforms the second-best baseline by approximately 14 percentage points, both in SMILES-based and graph-based metrics. We hope that this work will drive OCSR technology to more effectively meet real-world needs, thereby advancing the fields of cheminformatics and AI for Science. We will release GTR-CoT at https://github.com/opendatalab/GTR-CoT.
PDF122June 10, 2025