ChatPaper.aiChatPaper

OCRVerse:迈向端到端视觉语言模型中的整体光学字符识别

OCRVerse: Towards Holistic OCR in End-to-End Vision-Language Models

January 29, 2026
作者: Yufeng Zhong, Lei Chen, Xuanle Zhao, Wenkang Han, Liming Zheng, Jing Huang, Deyang Jiang, Yilin Cao, Lin Ma, Zhixiong Zeng
cs.AI

摘要

大型视觉语言模型的发展推动了对海量多模态数据管理及应用的需求,使得从视觉图像中提取信息的OCR技术日益受到关注。然而现有OCR方法主要聚焦于从图像或扫描文档中识别文本元素(以文本为中心的OCR),却忽视了从视觉信息密集的图像源(以视觉为中心的OCR)中识别视觉元素,例如图表、网页和科学图谱。现实中这类视觉信息密集的图像在互联网中广泛存在,具有重要的实际应用价值,如数据可视化和网页分析。本技术报告提出OCRVerse——首个端到端的整体OCR方法,可实现以文本为中心的OCR与以视觉为中心的OCR的统一处理。为此,我们构建了涵盖报纸、杂志、书籍等广泛文本中心文档,以及图表、网页、科学图谱等视觉中心渲染复合物的综合性数据工程。此外,我们提出两阶段SFT-RL多领域训练方法:SFT通过直接混合跨领域数据训练建立初始领域知识,而RL则针对各领域特性设计个性化奖励策略。具体而言,由于不同领域需要多样化的输出格式和预期结果,我们在RL阶段提供足够的灵活性,为每个领域定制灵活的奖励信号,从而提升跨领域融合能力并避免数据冲突。实验结果表明,OCRVerse在文本中心与视觉中心数据类型上均取得具有竞争力的结果,甚至可与大规模开源及闭源模型相媲美。
English
The development of large vision language models drives the demand for managing, and applying massive amounts of multimodal data, making OCR technology, which extracts information from visual images, increasingly popular. However, existing OCR methods primarily focus on recognizing text elements from images or scanned documents (Text-centric OCR), neglecting the identification of visual elements from visually information-dense image sources (Vision-centric OCR), such as charts, web pages and science plots. In reality, these visually information-dense images are widespread on the internet and have significant real-world application value, such as data visualization and web page analysis. In this technical report, we propose OCRVerse, the first holistic OCR method in end-to-end manner that enables unified text-centric OCR and vision-centric OCR. To this end, we constructe comprehensive data engineering to cover a wide range of text-centric documents, such as newspapers, magazines and books, as well as vision-centric rendered composites, including charts, web pages and scientific plots. Moreover, we propose a two-stage SFT-RL multi-domain training method for OCRVerse. SFT directly mixes cross-domain data to train and establish initial domain knowledge, while RL focuses on designing personalized reward strategies for the characteristics of each domain. Specifically, since different domains require various output formats and expected outputs, we provide sufficient flexibility in the RL stage to customize flexible reward signals for each domain, thereby improving cross-domain fusion and avoiding data conflicts. Experimental results demonstrate the effectiveness of OCRVerse, achieving competitive results across text-centric and vision-centric data types, even comparable to large-scale open-source and closed-source models.
PDF423January 31, 2026