OCRVerse:邁向端到端視覺語言模型中的整合式光學字元辨識
OCRVerse: Towards Holistic OCR in End-to-End Vision-Language Models
January 29, 2026
作者: Yufeng Zhong, Lei Chen, Xuanle Zhao, Wenkang Han, Liming Zheng, Jing Huang, Deyang Jiang, Yilin Cao, Lin Ma, Zhixiong Zeng
cs.AI
摘要
大型視覺語言模型的發展推動了對多模態海量數據管理與應用的需求,使得從視覺圖像中提取信息的OCR技術日益受到關注。然而現有OCR方法主要聚焦於從圖像或掃描文檔中識別文本元素(文本中心型OCR),卻忽視了從圖表、網頁和科學圖譜等視覺信息密集的圖像源中識別視覺元素(視覺中心型OCR)。這類視覺信息密集圖像在互聯網中廣泛存在,於數據可視化、網頁分析等場景具有重要應用價值。本技術報告提出OCRVerse——首個端到端的整體化OCR方法,實現文本中心型與視覺中心型OCR的統一處理。為此我們構建了覆蓋報刊、雜誌、書籍等文本中心型文檔,以及圖表、網頁、科學圖譜等視覺中心型渲染圖像的綜合數據工程。此外,我們提出兩階段SFT-RL多領域訓練方法:SFT階段通過直接混合跨領域數據建立初始領域知識,RL階段則針對各領域特性設計個性化獎勵策略。由於不同領域需輸出多樣化格式與預期結果,我們在RL階段提供充分靈活性,為各領域定制彈性獎勵信號,從而增強跨領域融合並避免數據衝突。實驗結果驗證了OCRVerse的有效性,其在文本中心型與視覺中心型數據上均取得具有競爭力的成果,甚至可與大規模開源及閉源模型相媲美。
English
The development of large vision language models drives the demand for managing, and applying massive amounts of multimodal data, making OCR technology, which extracts information from visual images, increasingly popular. However, existing OCR methods primarily focus on recognizing text elements from images or scanned documents (Text-centric OCR), neglecting the identification of visual elements from visually information-dense image sources (Vision-centric OCR), such as charts, web pages and science plots. In reality, these visually information-dense images are widespread on the internet and have significant real-world application value, such as data visualization and web page analysis. In this technical report, we propose OCRVerse, the first holistic OCR method in end-to-end manner that enables unified text-centric OCR and vision-centric OCR. To this end, we constructe comprehensive data engineering to cover a wide range of text-centric documents, such as newspapers, magazines and books, as well as vision-centric rendered composites, including charts, web pages and scientific plots. Moreover, we propose a two-stage SFT-RL multi-domain training method for OCRVerse. SFT directly mixes cross-domain data to train and establish initial domain knowledge, while RL focuses on designing personalized reward strategies for the characteristics of each domain. Specifically, since different domains require various output formats and expected outputs, we provide sufficient flexibility in the RL stage to customize flexible reward signals for each domain, thereby improving cross-domain fusion and avoiding data conflicts. Experimental results demonstrate the effectiveness of OCRVerse, achieving competitive results across text-centric and vision-centric data types, even comparable to large-scale open-source and closed-source models.