ChatPaper.aiChatPaper

GutenOCR:面向文件的接地視覺語言前端

GutenOCR: A Grounded Vision-Language Front-End for Documents

January 20, 2026
作者: Hunter Heidenreich, Ben Elliott, Olivia Dinica, Yosheb Getachew
cs.AI

摘要

GutenOCR 是基於 Qwen2.5-VL-3B 和 Qwen2.5-VL-7B 微調得來的系列接地光學字元辨識前端模型。這些單一檢查點的視覺語言模型透過統一的提示式介面,實現了閱讀、檢測與定位功能。該模型使用商業文件、科學文獻及合成定位資料進行訓練,支援整頁與局部閱讀,並能提供行級與段落級邊界框,以及條件式「X在哪裡?」的查詢功能。我們提出了一套接地 OCR 評估方案,結果顯示 GutenOCR-7B 在 10.5K 份保留的商業與科學文件上,其綜合接地 OCR 分數較基礎模型 Qwen2.5-VL-7B 提升逾一倍(從 0.40 升至 0.82)。在 Fox 和 OmniDocBench v1.5 基準測試中,本方法顯著提升了區域/行級 OCR 效能與文字檢測召回率,但也在頁面線性化、色彩引導 OCR 及公式密集版式等場景中顯現出效能權衡。
English
GutenOCR is a family of grounded OCR front-ends obtained by fine-tuning Qwen2.5-VL-3B and Qwen2.5-VL-7B. The resulting single-checkpoint vision-language models expose reading, detection, and grounding through a unified, prompt-based interface. Trained on business documents, scientific articles, and synthetic grounding data, the models support full-page and localized reading with line- and paragraph-level bounding boxes and conditional ``where is x?'' queries. We introduce a grounded OCR evaluation protocol and show that GutenOCR-7B more than doubles the composite grounded OCR score of its Qwen2.5-VL-7B backbone on 10.5K held-out business and scientific pages (0.40 to 0.82). On Fox and OmniDocBench v1.5, our approach substantially improves region- and line-level OCR as well as text-detection recall, but reveals trade-offs in page-level linearization, color-guided OCR, and formula-heavy layouts.
PDF142January 23, 2026