FireRed-OCR技术白皮书
FireRed-OCR Technical Report
March 2, 2026
作者: Hao Wu, Haoran Lou, Xinyue Li, Zuodong Zhong, Zhaojun Sun, Phellon Chen, Xuanhe Zhou, Kai Zuo, Yibo Chen, Xu Tang, Yao Hu, Boxiang Zhou, Jian Wu, Yongji Wu, Wenxin Yu, Yingmiao Liu, Yuhao Huang, Manjie Xu, Gang Liu, Yidong Ma, Zhichao Sun, Changhao Qiao
cs.AI
摘要
我们提出FireRed-OCR,这是一个将通用视觉语言模型(VLM)专项优化为高性能OCR模型的系统框架。当前大型视觉语言模型虽展现出强大的通用能力,但在处理复杂文档时常出现"结构幻觉"问题,制约了其在工业级OCR应用中的实用性。本文创新性地设计了FireRed-OCR框架,旨在将基于Qwen3-VL的通用VLM转化为像素级精确的结构化文档解析专家。针对高质量结构化数据稀缺的挑战,我们构建了"几何特征+语义理解"数据工厂——通过几何特征聚类与多维度标注取代传统随机采样,合成并筛选出高度均衡的数据集,有效应对长尾版式与罕见文档类型的处理需求。此外,我们提出三阶段渐进式训练策略:从像素级感知到逻辑结构生成逐步引导模型,具体包括:(1) 多任务预对齐阶段夯实文档结构理解基础;(2) 专项指令微调实现全图像Markdown标准化输出;(3) 格式约束的群组相对策略优化(GRPO),通过强化学习严格保证输出语法有效性与结构完整性(如表格闭合、公式语法)。在OmniDocBench v1.5上的大规模评估表明,FireRed-OCR以92.94%的综合得分实现最先进性能,在文本、公式、表格及阅读顺序等指标上显著超越DeepSeek-OCR 2和OCRVerse等强基线模型。我们开源代码与模型权重,以推动"通用VLM向专项结构解析专家"的范式演进。
English
We present FireRed-OCR, a systematic framework to specialize general VLMs into high-performance OCR models. Large Vision-Language Models (VLMs) have demonstrated impressive general capabilities but frequently suffer from ``structural hallucination'' when processing complex documents, limiting their utility in industrial OCR applications. In this paper, we introduce FireRed-OCR, a novel framework designed to transform general-purpose VLMs (based on Qwen3-VL) into pixel-precise structural document parsing experts. To address the scarcity of high-quality structured data, we construct a ``Geometry + Semantics'' Data Factory. Unlike traditional random sampling, our pipeline leverages geometric feature clustering and multi-dimensional tagging to synthesize and curate a highly balanced dataset, effectively handling long-tail layouts and rare document types. Furthermore, we propose a Three-Stage Progressive Training strategy that guides the model from pixel-level perception to logical structure generation. This curriculum includes: (1) Multi-task Pre-alignment to ground the model's understanding of document structure; (2) Specialized SFT for standardizing full-image Markdown output; and (3) Format-Constrained Group Relative Policy Optimization (GRPO), which utilizes reinforcement learning to enforce strict syntactic validity and structural integrity (e.g., table closure, formula syntax). Extensive evaluations on OmniDocBench v1.5 demonstrate that FireRed-OCR achieves state-of-the-art performance with an overall score of 92.94\%, significantly outperforming strong baselines such as DeepSeek-OCR 2 and OCRVerse across text, formula, table, and reading order metrics. We open-source our code and model weights to facilitate the ``General VLM to Specialized Structural Expert'' paradigm.