FireRed-OCR 기술 보고서

초록

본 논문에서는 범용 대규모 시각-언어 모델(VLM)을 고성능 OCR 모델로 특화시키기 위한 체계적인 프레임워크인 FireRed-OCR을 제안한다. 대규모 VLM은 인상적인 일반적인 능력을 보여주었지만, 복잡한 문서를 처리할 때 "구조적 환각" 문제가 빈번하게 발생하여 산업용 OCR 애플리케이션에서의 유용성이 제한된다. 본 논문에서는 범용 VLM(Qwen3-VL 기반)을 픽셀 단위 정밀도의 구조적 문서 파싱 전문가로 변환하도록 설계된 새로운 프레임워크인 FireRed-OCR을 소개한다. 고품질 구조화 데이터의 부족 문제를 해결하기 위해 "기하학 + 의미론" 데이터 팩토리를 구축했다. 기존의 무작위 샘플링과 달리, 우리의 파이프라인은 기하학적 특징 클러스터링과 다차원 태깅을 활용하여 균형 잡힌 데이터셋을 합성 및 선별하며, 롱테일 레이아웃과 희귀 문서 유형을 효과적으로 처리한다. 더 나아가, 모델을 픽셀 수준 인식에서 논리적 구조 생성으로 이끄는 3단계 점진적 학습 전략을 제안한다. 이 커리큘럼은 다음과 같다: (1) 문서 구조에 대한 모델의 이해를 확립하는 다중 작업 사전 정렬; (2) 전체 이미지 Markdown 출력 표준화를 위한 특화 SFT; (3) 강화 학습을 활용하여 엄격한 구문 유효성과 구조적 무결성(예: 테이블 닫힘, 수식 구문)을 강제하는 형식 제약 그룹 상대 정책 최적화(GRPO). OmniDocBench v1.5에 대한 광범위한 평가 결과, FireRed-OCR은 전체 점수 92.94%로 최첨단 성능을 달성하여 텍스트, 수식, 테이블, 읽기 순서 지표 전반에서 DeepSeek-OCR 2 및 OCRVerse와 같은 강력한 베이스라인을 크게 능가함을 보여준다. "범용 VLM에서 특화 구조 전문가로"의 패러다임을 촉진하기 위해 코드와 모델 가중치를 오픈소스로 공개한다.

English

We present FireRed-OCR, a systematic framework to specialize general VLMs into high-performance OCR models. Large Vision-Language Models (VLMs) have demonstrated impressive general capabilities but frequently suffer from ``structural hallucination'' when processing complex documents, limiting their utility in industrial OCR applications. In this paper, we introduce FireRed-OCR, a novel framework designed to transform general-purpose VLMs (based on Qwen3-VL) into pixel-precise structural document parsing experts. To address the scarcity of high-quality structured data, we construct a ``Geometry + Semantics'' Data Factory. Unlike traditional random sampling, our pipeline leverages geometric feature clustering and multi-dimensional tagging to synthesize and curate a highly balanced dataset, effectively handling long-tail layouts and rare document types. Furthermore, we propose a Three-Stage Progressive Training strategy that guides the model from pixel-level perception to logical structure generation. This curriculum includes: (1) Multi-task Pre-alignment to ground the model's understanding of document structure; (2) Specialized SFT for standardizing full-image Markdown output; and (3) Format-Constrained Group Relative Policy Optimization (GRPO), which utilizes reinforcement learning to enforce strict syntactic validity and structural integrity (e.g., table closure, formula syntax). Extensive evaluations on OmniDocBench v1.5 demonstrate that FireRed-OCR achieves state-of-the-art performance with an overall score of 92.94\%, significantly outperforming strong baselines such as DeepSeek-OCR 2 and OCRVerse across text, formula, table, and reading order metrics. We open-source our code and model weights to facilitate the ``General VLM to Specialized Structural Expert'' paradigm.

FireRed-OCR 기술 보고서

FireRed-OCR Technical Report

초록

Support