Qianfan-OCR：文書インテリジェンスのための統合エンドツーエンドモデル

要旨

我々はQianfan-OCRを発表する。これは40億パラメータのエンドツーエンド視覚言語モデルであり、文書解析、レイアウト分析、文書理解を単一アーキテクチャに統合している。本モデルは画像からMarkdownへの直接変換を実現し、表抽出、図表理解、文書QA、キー情報抽出など多様なプロンプト駆動タスクをサポートする。エンドツーエンドOCRにおける明示的なレイアウト分析の欠如に対処するため、我々はLayout-as-Thoughtを提案する。これは特殊な思考トークンによってトリガーされるオプションの思考段階であり、最終出力前に構造化されたレイアウト表現（バウンディングボックス、要素タイプ、読取順序）を生成することで、レイアウトの根拠付け能力を回復し複雑なレイアウトにおける精度を向上させる。Qianfan-OCRはOmniDocBench v1.5（93.12）とOlmOCR Bench（79.8）でエンドツーエンドモデル中首位を獲得し、OCRBench、CCOCR、DocVQA、ChartQAでは同規模の一般VLMと競合する結果を示し、公開キー情報抽出ベンチマークではGemini-3.1-Pro、Seed-2.0、Qwen3-VL-235Bを凌駕する最高平均スコアを達成した。本モデルはBaidu AI Cloud Qianfanプラットフォームで公開されている。

English

We present Qianfan-OCR, a 4B-parameter end-to-end vision-language model that unifies document parsing, layout analysis, and document understanding within a single architecture. It performs direct image-to-Markdown conversion and supports diverse prompt-driven tasks including table extraction, chart understanding, document QA, and key information extraction. To address the loss of explicit layout analysis in end-to-end OCR, we propose Layout-as-Thought, an optional thinking phase triggered by special think tokens that generates structured layout representations -- bounding boxes, element types, and reading order -- before producing final outputs, recovering layout grounding capabilities while improving accuracy on complex layouts. Qianfan-OCR ranks first among end-to-end models on OmniDocBench v1.5 (93.12) and OlmOCR Bench (79.8), achieves competitive results on OCRBench, CCOCR, DocVQA, and ChartQA against general VLMs of comparable scale, and attains the highest average score on public key information extraction benchmarks, surpassing Gemini-3.1-Pro, Seed-2.0, and Qwen3-VL-235B. The model is publicly accessible via the Baidu AI Cloud Qianfan platform.

Qianfan-OCR：文書インテリジェンスのための統合エンドツーエンドモデル

Qianfan-OCR: A Unified End-to-End Model for Document Intelligence

要旨

Support