FireRed-OCR技術報告書

要旨

本論文では、汎用大規模視覚言語モデル（VLM）を高性能OCRモデルへ特化させる体系的フレームワークであるFireRed-OCRを提案する。大規模視覚言語モデルは汎用的な能力で印象的な成果を示しているが、複雑な文書を処理する際に「構造的幻覚」に悩まされることが多く、産業OCRアプリケーションでの有用性が制限されている。本論文では、汎用VLM（Qwen3-VLベース）をピクセル精度の構造的文書解析の専門家へ変換することを目的とした新しいフレームワーク、FireRed-OCRを紹介する。高品質な構造化データの不足に対処するため、「幾何情報＋意味情報」データファクトリを構築した。従来のランダムサンプリングとは異なり、本パイプラインは幾何学的特徴のクラスタリングと多次元タグ付けを活用し、極めてバランスの取れたデータセットを合成・精選することで、ロングテールのレイアウトや稀な文書タイプを効果的に扱う。さらに、モデルをピクセルレベルの知覚から論理構造生成へと導く三段階の段階的学習戦略を提案する。このカリキュラムは以下を含む：（1）文書構造の理解をモデルに根付かせるためのマルチタスク事前調整、（2）全画像マークダウン出力を標準化するための特化したSFT（教師ありファインチューニング）、（3）強化学習を利用して厳密な構文の有効性と構造的完全性（例：テーブルの閉じタグ、数式の構文）を強制するフォーマット制約付きグループ相対ポリシー最適化（GRPO）。OmniDocBench v1.5における大規模な評価により、FireRed-OCRが総合スコア92.94%で最先端の性能を達成し、テキスト、数式、表、読取順序の各指標においてDeepSeek-OCR 2やOCRVerseなどの強力なベースラインを大幅に上回ることを実証した。「汎用VLMから特化構造専門家へ」のパラダイムを促進するため、コードとモデル重みを公開する。

English

We present FireRed-OCR, a systematic framework to specialize general VLMs into high-performance OCR models. Large Vision-Language Models (VLMs) have demonstrated impressive general capabilities but frequently suffer from ``structural hallucination'' when processing complex documents, limiting their utility in industrial OCR applications. In this paper, we introduce FireRed-OCR, a novel framework designed to transform general-purpose VLMs (based on Qwen3-VL) into pixel-precise structural document parsing experts. To address the scarcity of high-quality structured data, we construct a ``Geometry + Semantics'' Data Factory. Unlike traditional random sampling, our pipeline leverages geometric feature clustering and multi-dimensional tagging to synthesize and curate a highly balanced dataset, effectively handling long-tail layouts and rare document types. Furthermore, we propose a Three-Stage Progressive Training strategy that guides the model from pixel-level perception to logical structure generation. This curriculum includes: (1) Multi-task Pre-alignment to ground the model's understanding of document structure; (2) Specialized SFT for standardizing full-image Markdown output; and (3) Format-Constrained Group Relative Policy Optimization (GRPO), which utilizes reinforcement learning to enforce strict syntactic validity and structural integrity (e.g., table closure, formula syntax). Extensive evaluations on OmniDocBench v1.5 demonstrate that FireRed-OCR achieves state-of-the-art performance with an overall score of 92.94\%, significantly outperforming strong baselines such as DeepSeek-OCR 2 and OCRVerse across text, formula, table, and reading order metrics. We open-source our code and model weights to facilitate the ``General VLM to Specialized Structural Expert'' paradigm.

FireRed-OCR技術報告書

FireRed-OCR Technical Report

要旨

Support