千帆OCR：面向文档智能的统一端到端模型

摘要

我们推出千帆-OCR，这是一个40亿参数端到端视觉语言模型，将文档解析、版面分析与文档理解统一于单一架构。该模型支持直接图像到Markdown的转换，并能通过多样化提示驱动任务，包括表格提取、图表理解、文档问答及关键信息抽取。为解决端到端OCR中显式版面分析缺失的问题，我们提出"布局即思维"机制——通过特殊思考令牌触发的可选推理阶段，在生成最终输出前先产生结构化版面表征（包含边界框、元素类型和阅读顺序），既恢复了布局定位能力，又提升了复杂版面的处理精度。千帆-OCR在OmniDocBench v1.5（93.12分）和OlmOCR Bench（79.8分）端到端模型中排名第一，在OCRBench、CCOCR、DocVQA和ChartQA上取得与同规模通用VLM相媲美的成绩，并在公开关键信息抽取基准测试中平均得分最高，超越Gemini-3.1-Pro、Seed-2.0和Qwen3-VL-235B。该模型已通过百度智能云千帆平台对外开放。

English

We present Qianfan-OCR, a 4B-parameter end-to-end vision-language model that unifies document parsing, layout analysis, and document understanding within a single architecture. It performs direct image-to-Markdown conversion and supports diverse prompt-driven tasks including table extraction, chart understanding, document QA, and key information extraction. To address the loss of explicit layout analysis in end-to-end OCR, we propose Layout-as-Thought, an optional thinking phase triggered by special think tokens that generates structured layout representations -- bounding boxes, element types, and reading order -- before producing final outputs, recovering layout grounding capabilities while improving accuracy on complex layouts. Qianfan-OCR ranks first among end-to-end models on OmniDocBench v1.5 (93.12) and OlmOCR Bench (79.8), achieves competitive results on OCRBench, CCOCR, DocVQA, and ChartQA against general VLMs of comparable scale, and attains the highest average score on public key information extraction benchmarks, surpassing Gemini-3.1-Pro, Seed-2.0, and Qwen3-VL-235B. The model is publicly accessible via the Baidu AI Cloud Qianfan platform.

千帆OCR：面向文档智能的统一端到端模型

Qianfan-OCR: A Unified End-to-End Model for Document Intelligence

摘要

Support