GTR-CoT:图遍历作为分子结构识别的视觉思维链
GTR-CoT: Graph Traversal as Visual Chain of Thought for Molecular Structure Recognition
June 9, 2025
作者: Jingchao Wang, Haote Yang, Jiang Wu, Yifan He, Xingjian Wei, Yinfan Wang, Chengjin Liu, Lingli Ge, Lijun Wu, Bin Wang, Dahua Lin, Conghui He
cs.AI
摘要
光学化学结构识别(OCSR)对于将化学知识数字化,即将分子图像转换为机器可读格式至关重要。尽管近期的视觉-语言模型(VLMs)在此任务中展现出潜力,但其基于图像描述的方法在处理复杂分子结构及不一致注释时往往力不从心。为克服这些挑战,我们提出了GTR-Mol-VLM,一个创新框架,具备两大关键特性:(1) 图遍历作为视觉思维链机制,通过逐步预测原子与键来模拟人类解析分子图的过程;(2) 数据为中心的原则——“忠实识别所见”,旨在解决图像中简化结构与扩展注释之间的不匹配问题。为支持模型开发,我们构建了GTR-CoT-1.3M,一个大规模指令调优数据集,其注释经过精心校正,并推出了MolRec-Bench,首个专为OCSR中图解析精度细粒度评估设计的基准。全面实验表明,GTR-Mol-VLM在对比专业模型、化学领域VLMs及商用通用VLMs时均取得了更优成绩。特别是在涉及含功能团缩写分子图像的场景下,GTR-Mol-VLM在基于SMILES和基于图的指标上均领先次优基线约14个百分点。我们期望这项工作能推动OCSR技术更有效地满足实际需求,从而促进化学信息学及科学人工智能领域的发展。GTR-CoT数据集将在https://github.com/opendatalab/GTR-CoT 发布。
English
Optical Chemical Structure Recognition (OCSR) is crucial for digitizing
chemical knowledge by converting molecular images into machine-readable
formats. While recent vision-language models (VLMs) have shown potential in
this task, their image-captioning approach often struggles with complex
molecular structures and inconsistent annotations. To overcome these
challenges, we introduce GTR-Mol-VLM, a novel framework featuring two key
innovations: (1) the Graph Traversal as Visual Chain of Thought
mechanism that emulates human reasoning by incrementally parsing molecular
graphs through sequential atom-bond predictions, and (2) the data-centric
principle of Faithfully Recognize What You've Seen, which addresses
the mismatch between abbreviated structures in images and their expanded
annotations. To support model development, we constructed GTR-CoT-1.3M, a
large-scale instruction-tuning dataset with meticulously corrected annotations,
and introduced MolRec-Bench, the first benchmark designed for a fine-grained
evaluation of graph-parsing accuracy in OCSR. Comprehensive experiments
demonstrate that GTR-Mol-VLM achieves superior results compared to specialist
models, chemistry-domain VLMs, and commercial general-purpose VLMs. Notably, in
scenarios involving molecular images with functional group abbreviations,
GTR-Mol-VLM outperforms the second-best baseline by approximately 14 percentage
points, both in SMILES-based and graph-based metrics. We hope that this work
will drive OCSR technology to more effectively meet real-world needs, thereby
advancing the fields of cheminformatics and AI for Science. We will release
GTR-CoT at https://github.com/opendatalab/GTR-CoT.