**Qianfan-OCR: 문서 인텔리전스를 위한 통합 엔드투엔드 모델**

초록

우리는 단일 아키텍처 내에서 문서 파싱, 레이아웃 분석, 문서 이해를 통합한 40억 파라미터 규모의 종단간(end-to-end) 비전-언어 모델인 Qianfan-OCR을 제안한다. 본 모델은 이미지-마크다운 직접 변환을 수행하며, 표 추출, 차트 이해, 문서 질의응답, 핵심 정보 추출 등 다양한 프롬프트 기반 작업을 지원한다. 종단간 OCR에서 명시적인 레이아웃 분석 기능이 상실되는 문제를 해결하기 위해, 특수 사고(think) 토큰에 의해 트리거되는 선택적 사고 단계인 Layout-as-Thought를 제안한다. 이는 최종 출력을 생성하기 전에 구조화된 레이아웃 표현(바운딩 박스, 요소 유형, 읽기 순서)을 생성하여 레이아웃 기반 추론 능력을 회복시키고 복잡한 레이아웃에서의 정확도를 향상시킨다. Qianfan-OCR은 OmniDocBench v1.5(93.12점)와 OlmOCR Bench(79.8점)에서 종단간 모델 중 1위를 차지했으며, OCRBench, CCOCR, DocVQA, ChartQA에서 유사 규모의 범용 비전-언어 모델 대비 경쟁력 있는 결과를 달성했다. 또한 공개 핵심 정보 추출 벤치마크에서 Gemini-3.1-Pro, Seed-2.0, Qwen3-VL-235B를 능가하는 최고 평균 점수를 기록했다. 본 모델은 바이두 AI 클라우드 Qianfan 플랫폼을 통해 공개되어 있다.

English

We present Qianfan-OCR, a 4B-parameter end-to-end vision-language model that unifies document parsing, layout analysis, and document understanding within a single architecture. It performs direct image-to-Markdown conversion and supports diverse prompt-driven tasks including table extraction, chart understanding, document QA, and key information extraction. To address the loss of explicit layout analysis in end-to-end OCR, we propose Layout-as-Thought, an optional thinking phase triggered by special think tokens that generates structured layout representations -- bounding boxes, element types, and reading order -- before producing final outputs, recovering layout grounding capabilities while improving accuracy on complex layouts. Qianfan-OCR ranks first among end-to-end models on OmniDocBench v1.5 (93.12) and OlmOCR Bench (79.8), achieves competitive results on OCRBench, CCOCR, DocVQA, and ChartQA against general VLMs of comparable scale, and attains the highest average score on public key information extraction benchmarks, surpassing Gemini-3.1-Pro, Seed-2.0, and Qwen3-VL-235B. The model is publicly accessible via the Baidu AI Cloud Qianfan platform.

Qianfan-OCR: 문서 인텔리전스를 위한 통합 엔드투엔드 모델

Qianfan-OCR: A Unified End-to-End Model for Document Intelligence

초록

Support