多模态OCR：全方位解析文档内容

摘要

我们提出多模态OCR（MOCR），一种将文本与图形联合解析为统一文本表征的文档解析范式。与传统OCR系统仅关注文本识别而将图形区域作为裁剪像素不同，我们的dots.mocr方法将图表、图示、表格和图标等视觉元素视作一级解析目标，使系统能在解析文档时保持元素间的语义关联。该方法具备三大优势：（1）将文本与图形重构为结构化输出，实现更精准的文档重建；（2）支持对异构文档元素进行端到端训练，使模型能利用文本与视觉组件间的语义关系；（3）将以往被丢弃的图形转化为可复用的代码级监督信号，释放现有文档中嵌入的多模态监督潜力。为实现规模化应用，我们基于PDF文件、渲染网页及原生SVG资源构建了完整的数据引擎，并通过分阶段预训练与监督微调训练出紧凑的30亿参数模型。我们从文档解析和结构化图形解析两个维度评估dots.mocr：在文档解析基准测试中，其在我们OCR竞技场Elo排行榜上仅次于Gemini 3 Pro，超越现有开源文档解析系统，并在olmOCR基准测试中以83.9分创下新纪录；在结构化图形解析方面，dots.mocr在图像转SVG基准测试中重建质量优于Gemini 3 Pro，在图表、UI布局、科学图示和化学结构式等任务上表现强劲。这些成果为构建大规模图像到代码语料库以实现多模态预训练提供了可行路径。代码与模型已开源：https://github.com/rednote-hilab/dots.mocr。

English

We present Multimodal OCR (MOCR), a document parsing paradigm that jointly parses text and graphics into unified textual representations. Unlike conventional OCR systems that focus on text recognition and leave graphical regions as cropped pixels, our method, termed dots.mocr, treats visual elements such as charts, diagrams, tables, and icons as first-class parsing targets, enabling systems to parse documents while preserving semantic relationships across elements. It offers several advantages: (1) it reconstructs both text and graphics as structured outputs, enabling more faithful document reconstruction; (2) it supports end-to-end training over heterogeneous document elements, allowing models to exploit semantic relations between textual and visual components; and (3) it converts previously discarded graphics into reusable code-level supervision, unlocking multimodal supervision embedded in existing documents. To make this paradigm practical at scale, we build a comprehensive data engine from PDFs, rendered webpages, and native SVG assets, and train a compact 3B-parameter model through staged pretraining and supervised fine-tuning. We evaluate dots.mocr from two perspectives: document parsing and structured graphics parsing. On document parsing benchmarks, it ranks second only to Gemini 3 Pro on our OCR Arena Elo leaderboard, surpasses existing open-source document parsing systems, and sets a new state of the art of 83.9 on olmOCR Bench. On structured graphics parsing, dots.mocr achieves higher reconstruction quality than Gemini 3 Pro across image-to-SVG benchmarks, demonstrating strong performance on charts, UI layouts, scientific figures, and chemical diagrams. These results show a scalable path toward building large-scale image-to-code corpora for multimodal pretraining. Code and models are publicly available at https://github.com/rednote-hilab/dots.mocr.