GTR-CoT: Grafentraversering als Visuele Ketting van Gedachten voor Moleculaire Structuurherkenning

Samenvatting

Optical Chemical Structure Recognition (OCSR) is essentieel voor het digitaliseren van chemische kennis door moleculaire afbeeldingen om te zetten in machineleesbare formaten. Hoewel recente vision-language-modellen (VLMs) potentieel hebben getoond in deze taak, worstelt hun beeld-bijschriftbenadering vaak met complexe moleculaire structuren en inconsistente annotaties. Om deze uitdagingen te overwinnen, introduceren we GTR-Mol-VLM, een nieuw raamwerk met twee belangrijke innovaties: (1) het Graph Traversal as Visual Chain of Thought-mechanisme dat menselijk redeneren nabootst door moleculaire grafieken stapsgewijs te ontleden via sequentiële atoom-bindingvoorspellingen, en (2) het data-centrische principe van Faithfully Recognize What You've Seen, dat de mismatch aanpakt tussen afgekorte structuren in afbeeldingen en hun uitgebreide annotaties. Om modelontwikkeling te ondersteunen, hebben we GTR-CoT-1.3M geconstrueerd, een grootschalige instructie-afstemmingsdataset met zorgvuldig gecorrigeerde annotaties, en hebben we MolRec-Bench geïntroduceerd, de eerste benchmark ontworpen voor een gedetailleerde evaluatie van grafiek-ontledingsnauwkeurigheid in OCSR. Uitgebreide experimenten tonen aan dat GTR-Mol-VLM superieure resultaten behaalt in vergelijking met gespecialiseerde modellen, chemie-domein VLMs en commerciële algemene VLMs. Opmerkelijk is dat in scenario's met moleculaire afbeeldingen met functionele groepafkortingen, GTR-Mol-VLM de op één na beste baseline met ongeveer 14 procentpunten overtreft, zowel in SMILES-gebaseerde als grafiek-gebaseerde metrieken. We hopen dat dit werk OCSR-technologie zal stimuleren om effectiever aan real-world behoeften te voldoen, waardoor de velden van cheminformatica en AI for Science worden bevorderd. We zullen GTR-CoT vrijgeven op https://github.com/opendatalab/GTR-CoT.

English

Optical Chemical Structure Recognition (OCSR) is crucial for digitizing chemical knowledge by converting molecular images into machine-readable formats. While recent vision-language models (VLMs) have shown potential in this task, their image-captioning approach often struggles with complex molecular structures and inconsistent annotations. To overcome these challenges, we introduce GTR-Mol-VLM, a novel framework featuring two key innovations: (1) the Graph Traversal as Visual Chain of Thought mechanism that emulates human reasoning by incrementally parsing molecular graphs through sequential atom-bond predictions, and (2) the data-centric principle of Faithfully Recognize What You've Seen, which addresses the mismatch between abbreviated structures in images and their expanded annotations. To support model development, we constructed GTR-CoT-1.3M, a large-scale instruction-tuning dataset with meticulously corrected annotations, and introduced MolRec-Bench, the first benchmark designed for a fine-grained evaluation of graph-parsing accuracy in OCSR. Comprehensive experiments demonstrate that GTR-Mol-VLM achieves superior results compared to specialist models, chemistry-domain VLMs, and commercial general-purpose VLMs. Notably, in scenarios involving molecular images with functional group abbreviations, GTR-Mol-VLM outperforms the second-best baseline by approximately 14 percentage points, both in SMILES-based and graph-based metrics. We hope that this work will drive OCSR technology to more effectively meet real-world needs, thereby advancing the fields of cheminformatics and AI for Science. We will release GTR-CoT at https://github.com/opendatalab/GTR-CoT.

GTR-CoT: Grafentraversering als Visuele Ketting van Gedachten voor Moleculaire Structuurherkenning

GTR-CoT: Graph Traversal as Visual Chain of Thought for Molecular Structure Recognition

Samenvatting

Support