千帆OCR:面向文檔智能的統一端到端模型
Qianfan-OCR: A Unified End-to-End Model for Document Intelligence
March 11, 2026
作者: Daxiang Dong, Mingming Zheng, Dong Xu, Chunhua Luo, Bairong Zhuang, Yuxuan Li, Ruoyun He, Haoran Wang, Wenyu Zhang, Wenbo Wang, Yicheng Wang, Xue Xiong, Ayong Zheng, Xiaoying Zuo, Ziwei Ou, Jingnan Gu, Quanhao Guo, Jianmin Wu, Dawei Yin, Dou Shen
cs.AI
摘要
我們推出千帆OCR,這是一個擁有40億參數的端到端視覺語言模型,將文檔解析、版面分析與文檔理解統一整合於單一架構中。該模型可直接實現圖像到Markdown格式的轉換,並支持多種提示驅動任務,包括表格提取、圖表理解、文檔問答及關鍵信息提取。為解決端到端OCR中顯性版面分析缺失的問題,我們提出「佈局即思維」機制,通過特殊思維標記觸發的可選思考階段,在生成最終輸出前先產生結構化版面表徵——包括邊界框、元素類型和閱讀順序——從而恢復版面定位能力,並提升複雜版面處理的準確性。千帆OCR在OmniDocBench v1.5(93.12分)和OlmOCR Bench(79.8分)上位列端到端模型榜首,在OCRBench、CCOCR、DocVQA和ChartQA等基準測試中與同規模通用VLM模型表現相當,並在公開關鍵信息提取基準上取得最高平均分,超越Gemini-3.1-Pro、Seed-2.0和Qwen3-VL-235B。該模型已通過百度智能雲千帆平台對外開放使用。
English
We present Qianfan-OCR, a 4B-parameter end-to-end vision-language model that unifies document parsing, layout analysis, and document understanding within a single architecture. It performs direct image-to-Markdown conversion and supports diverse prompt-driven tasks including table extraction, chart understanding, document QA, and key information extraction. To address the loss of explicit layout analysis in end-to-end OCR, we propose Layout-as-Thought, an optional thinking phase triggered by special think tokens that generates structured layout representations -- bounding boxes, element types, and reading order -- before producing final outputs, recovering layout grounding capabilities while improving accuracy on complex layouts. Qianfan-OCR ranks first among end-to-end models on OmniDocBench v1.5 (93.12) and OlmOCR Bench (79.8), achieves competitive results on OCRBench, CCOCR, DocVQA, and ChartQA against general VLMs of comparable scale, and attains the highest average score on public key information extraction benchmarks, surpassing Gemini-3.1-Pro, Seed-2.0, and Qwen3-VL-235B. The model is publicly accessible via the Baidu AI Cloud Qianfan platform.