ChatPaper.aiChatPaper

FireRed-OCR技术报告

FireRed-OCR Technical Report

March 2, 2026
作者: Hao Wu, Haoran Lou, Xinyue Li, Zuodong Zhong, Zhaojun Sun, Phellon Chen, Xuanhe Zhou, Kai Zuo, Yibo Chen, Xu Tang, Yao Hu, Boxiang Zhou, Jian Wu, Yongji Wu, Wenxin Yu, Yingmiao Liu, Yuhao Huang, Manjie Xu, Gang Liu, Yidong Ma, Zhichao Sun, Changhao Qiao
cs.AI

摘要

我们提出FireRed-OCR——一个将通用视觉语言模型(VLM)专项优化为高性能OCR模型的系统化框架。当前大型视觉语言模型虽展现出卓越的通用能力,但在处理复杂文档时常出现"结构幻觉"现象,制约了其在工业级OCR应用中的实用性。本文创新性地设计了FireRed-OCR框架,旨在将基于Qwen3-VL的通用VLM转化为像素级精确的结构化文档解析专家。针对高质量结构化数据稀缺的难题,我们构建了"几何特征+语义理解"双驱动的数据工厂:通过几何特征聚类与多维度标注技术替代传统随机采样,合成并筛选出高度均衡的数据集,有效覆盖长尾版式与罕见文档类型。此外,我们提出三阶段渐进式训练策略,引导模型从像素级感知进阶至逻辑结构生成。该课程体系包含:(1)多任务预对齐阶段,夯实模型对文档结构的认知基础;(2)专项指令微调阶段,标准化全图像Markdown输出;(3)格式约束的群组相对策略优化(GRPO),利用强化学习严格保证输出结果的语法有效性与结构完整性(如表格闭合、公式语法等)。在OmniDocBench v1.5基准测试中,FireRed-OCR以92.94%的综合得分刷新性能纪录,在文本、公式、表格及阅读顺序等指标上显著超越DeepSeek-OCR 2和OCRVerse等强基线模型。我们开源代码与模型权重,以推动"通用VLM向专项结构解析专家"的范式演进。
English
We present FireRed-OCR, a systematic framework to specialize general VLMs into high-performance OCR models. Large Vision-Language Models (VLMs) have demonstrated impressive general capabilities but frequently suffer from ``structural hallucination'' when processing complex documents, limiting their utility in industrial OCR applications. In this paper, we introduce FireRed-OCR, a novel framework designed to transform general-purpose VLMs (based on Qwen3-VL) into pixel-precise structural document parsing experts. To address the scarcity of high-quality structured data, we construct a ``Geometry + Semantics'' Data Factory. Unlike traditional random sampling, our pipeline leverages geometric feature clustering and multi-dimensional tagging to synthesize and curate a highly balanced dataset, effectively handling long-tail layouts and rare document types. Furthermore, we propose a Three-Stage Progressive Training strategy that guides the model from pixel-level perception to logical structure generation. This curriculum includes: (1) Multi-task Pre-alignment to ground the model's understanding of document structure; (2) Specialized SFT for standardizing full-image Markdown output; and (3) Format-Constrained Group Relative Policy Optimization (GRPO), which utilizes reinforcement learning to enforce strict syntactic validity and structural integrity (e.g., table closure, formula syntax). Extensive evaluations on OmniDocBench v1.5 demonstrate that FireRed-OCR achieves state-of-the-art performance with an overall score of 92.94\%, significantly outperforming strong baselines such as DeepSeek-OCR 2 and OCRVerse across text, formula, table, and reading order metrics. We open-source our code and model weights to facilitate the ``General VLM to Specialized Structural Expert'' paradigm.
PDF00March 4, 2026