多模態光學字元辨識：全方位解析文件內容

摘要

我們提出多模態光學字元辨識（MOCR），這是一種將文字與圖形共同解析為統一文字表徵的文件解析範式。有別於傳統僅專注文字辨識並將圖形區域保留為裁剪像素的OCR系統，我們命名為dots.mocr的方法將圖表、示意圖、表格與圖示等視覺元素視為一級解析目標，使系統能在解析文件時保持元素間的語義關聯。此方法具備三大優勢：（1）將文字與圖形重建為結構化輸出，實現更忠實的文件重建；（2）支援對異質文件元素進行端到端訓練，使模型能利用文字與視覺組件間的語義關係；（3）將過往被捨棄的圖形轉換為可重複使用的程式碼級監督信號，釋放既有文件中嵌入的多模態監督潛力。為實現大規模應用，我們基於PDF文件、渲染網頁及原生SVG資源建構完整數據引擎，並透過分階段預訓練與監督式微調訓練出僅30億參數的緊湊模型。我們從文件解析與結構化圖形解析雙維度評估dots.mocr：在文件解析基準測試中，其於OCR競技場Elo排行榜僅次於Gemini 3 Pro，超越現有開源文件解析系統，並在olmOCR基準以83.9分創下新紀錄；在結構化圖形解析方面，dots.mocr於圖像轉SVG基準測試中重建品質優於Gemini 3 Pro，於圖表、UI佈局、科學圖示與化學結構式均展現強勁性能。這些成果為建構大規模圖像轉程式碼語料庫以實現多模態預訓練開闢可行路徑。程式碼與模型已公開於：https://github.com/rednote-hilab/dots.mocr。

English

We present Multimodal OCR (MOCR), a document parsing paradigm that jointly parses text and graphics into unified textual representations. Unlike conventional OCR systems that focus on text recognition and leave graphical regions as cropped pixels, our method, termed dots.mocr, treats visual elements such as charts, diagrams, tables, and icons as first-class parsing targets, enabling systems to parse documents while preserving semantic relationships across elements. It offers several advantages: (1) it reconstructs both text and graphics as structured outputs, enabling more faithful document reconstruction; (2) it supports end-to-end training over heterogeneous document elements, allowing models to exploit semantic relations between textual and visual components; and (3) it converts previously discarded graphics into reusable code-level supervision, unlocking multimodal supervision embedded in existing documents. To make this paradigm practical at scale, we build a comprehensive data engine from PDFs, rendered webpages, and native SVG assets, and train a compact 3B-parameter model through staged pretraining and supervised fine-tuning. We evaluate dots.mocr from two perspectives: document parsing and structured graphics parsing. On document parsing benchmarks, it ranks second only to Gemini 3 Pro on our OCR Arena Elo leaderboard, surpasses existing open-source document parsing systems, and sets a new state of the art of 83.9 on olmOCR Bench. On structured graphics parsing, dots.mocr achieves higher reconstruction quality than Gemini 3 Pro across image-to-SVG benchmarks, demonstrating strong performance on charts, UI layouts, scientific figures, and chemical diagrams. These results show a scalable path toward building large-scale image-to-code corpora for multimodal pretraining. Code and models are publicly available at https://github.com/rednote-hilab/dots.mocr.

多模態光學字元辨識：全方位解析文件內容

Multimodal OCR: Parse Anything from Documents

摘要

Support