ChatPaper.aiChatPaper

GutenOCR:面向文档的接地气视觉语言前端

GutenOCR: A Grounded Vision-Language Front-End for Documents

January 20, 2026
作者: Hunter Heidenreich, Ben Elliott, Olivia Dinica, Yosheb Getachew
cs.AI

摘要

GutenOCR是基于Qwen2.5-VL-3B与Qwen2.5-VL-7B微调得到的系列端到端OCR前端模型。这些单检查点的视觉语言模型通过统一的提示式接口,实现了文本识别、检测与定位功能。该模型基于商业文档、科学文献及合成定位数据训练,支持整页与局部阅读,可输出行级/段落级边界框,并响应条件式"X在哪里?"的查询。我们提出了带定位功能的OCR评估方案,实验表明GutenOCR-7B在1.05万份留存的商业与科学文档上的综合定位OCR得分较其骨干网络Qwen2.5-VL-7B提升超一倍(0.40→0.82)。在Fox与OmniDocBench v1.5基准测试中,本方法显著提升了区域/行级OCR性能及文本检测召回率,但在页面级线性化、色彩引导OCR及公式密集版块处理方面存在权衡。
English
GutenOCR is a family of grounded OCR front-ends obtained by fine-tuning Qwen2.5-VL-3B and Qwen2.5-VL-7B. The resulting single-checkpoint vision-language models expose reading, detection, and grounding through a unified, prompt-based interface. Trained on business documents, scientific articles, and synthetic grounding data, the models support full-page and localized reading with line- and paragraph-level bounding boxes and conditional ``where is x?'' queries. We introduce a grounded OCR evaluation protocol and show that GutenOCR-7B more than doubles the composite grounded OCR score of its Qwen2.5-VL-7B backbone on 10.5K held-out business and scientific pages (0.40 to 0.82). On Fox and OmniDocBench v1.5, our approach substantially improves region- and line-level OCR as well as text-detection recall, but reveals trade-offs in page-level linearization, color-guided OCR, and formula-heavy layouts.
PDF142January 23, 2026