GTR-CoT：圖形遍歷作為分子結構識別的視覺思維鏈

摘要

光學化學結構識別（OCSR）對於將化學知識數位化至關重要，它能夠將分子圖像轉換為機器可讀的格式。儘管近期的視覺-語言模型（VLMs）在此任務中展現出潛力，但其基於圖像-字幕的方法在處理複雜分子結構和不一致的註釋時往往面臨困難。為克服這些挑戰，我們提出了GTR-Mol-VLM，一個具有兩項關鍵創新的新框架：（1）圖形遍歷作為視覺思維鏈機制，通過逐步解析分子圖來模擬人類推理，實現原子-鍵的序列預測；（2）數據中心原則“忠實識別所見”，解決了圖像中縮寫結構與其擴展註釋之間的不匹配問題。為支持模型開發，我們構建了GTR-CoT-1.3M，一個大規模的指令調優數據集，其註釋經過精心校正，並引入了MolRec-Bench，這是首個專為OCSR中圖形解析精細評估設計的基準。全面的實驗表明，GTR-Mol-VLM在與專業模型、化學領域VLMs及商業通用VLMs的比較中取得了優異成績。特別是在涉及帶有功能基團縮寫的分子圖像場景中，GTR-Mol-VLM在基於SMILES和基於圖形的指標上均比次佳基線高出約14個百分點。我們希望這項工作能推動OCSR技術更有效地滿足現實需求，從而促進化學信息學和科學AI領域的發展。我們將在https://github.com/opendatalab/GTR-CoT發布GTR-CoT。

English

Optical Chemical Structure Recognition (OCSR) is crucial for digitizing chemical knowledge by converting molecular images into machine-readable formats. While recent vision-language models (VLMs) have shown potential in this task, their image-captioning approach often struggles with complex molecular structures and inconsistent annotations. To overcome these challenges, we introduce GTR-Mol-VLM, a novel framework featuring two key innovations: (1) the Graph Traversal as Visual Chain of Thought mechanism that emulates human reasoning by incrementally parsing molecular graphs through sequential atom-bond predictions, and (2) the data-centric principle of Faithfully Recognize What You've Seen, which addresses the mismatch between abbreviated structures in images and their expanded annotations. To support model development, we constructed GTR-CoT-1.3M, a large-scale instruction-tuning dataset with meticulously corrected annotations, and introduced MolRec-Bench, the first benchmark designed for a fine-grained evaluation of graph-parsing accuracy in OCSR. Comprehensive experiments demonstrate that GTR-Mol-VLM achieves superior results compared to specialist models, chemistry-domain VLMs, and commercial general-purpose VLMs. Notably, in scenarios involving molecular images with functional group abbreviations, GTR-Mol-VLM outperforms the second-best baseline by approximately 14 percentage points, both in SMILES-based and graph-based metrics. We hope that this work will drive OCSR technology to more effectively meet real-world needs, thereby advancing the fields of cheminformatics and AI for Science. We will release GTR-CoT at https://github.com/opendatalab/GTR-CoT.

GTR-CoT：圖形遍歷作為分子結構識別的視覺思維鏈

GTR-CoT: Graph Traversal as Visual Chain of Thought for Molecular Structure Recognition

摘要

Support