Qianfan-OCR: Ein vereinheitlichtes End-to-End-Modell für Document Intelligence

Zusammenfassung

Wir stellen Qianfan-OCR vor, ein end-to-end Vision-Language-Modell mit 4B Parametern, das Dokumentenparsing, Layoutanalyse und Dokumentenverständnis in einer einzigen Architektur vereint. Es führt eine direkte Bild-zu-Markdown-Konvertierung durch und unterstützt diverse promptgesteuerte Aufgaben, darunter Tabellenextraktion, Diagrammverständnis, Document QA und die Extraktion von Schlüsselinformationen. Um den Verlust expliziter Layoutanalyse in end-to-end OCR zu adressieren, schlagen wir Layout-as-Thought vor, eine optionale Denkphase, die durch spezielle Think-Tokens ausgelöst wird und strukturierte Layoutrepräsentationen – Begrenzungsrahmen, Elementtypen und Lesereihenfolge – erzeugt, bevor endgültige Ausgaben produziert werden. Dies stellt Layout-Verankerungsfähigkeiten wieder her und verbessert die Genauigkeit bei komplexen Layouts. Qianfan-OCR belegt unter end-to-end Modellen den ersten Platz auf OmniDocBench v1.5 (93.12) und OlmOCR Bench (79.8), erzielt wettbewerbsfähige Ergebnisse auf OCRBench, CCOCR, DocVQA und ChartQA im Vergleich zu allgemeinen VLMs vergleichbarer Größe und erreicht die höchste Durchschnittspunktzahl auf öffentlichen Benchmarks zur Extraktion von Schlüsselinformationen, wobei es Gemini-3.1-Pro, Seed-2.0 und Qwen3-VL-235B übertrifft. Das Modell ist öffentlich über die Baidu AI Cloud Qianfan-Plattform zugänglich.

English

We present Qianfan-OCR, a 4B-parameter end-to-end vision-language model that unifies document parsing, layout analysis, and document understanding within a single architecture. It performs direct image-to-Markdown conversion and supports diverse prompt-driven tasks including table extraction, chart understanding, document QA, and key information extraction. To address the loss of explicit layout analysis in end-to-end OCR, we propose Layout-as-Thought, an optional thinking phase triggered by special think tokens that generates structured layout representations -- bounding boxes, element types, and reading order -- before producing final outputs, recovering layout grounding capabilities while improving accuracy on complex layouts. Qianfan-OCR ranks first among end-to-end models on OmniDocBench v1.5 (93.12) and OlmOCR Bench (79.8), achieves competitive results on OCRBench, CCOCR, DocVQA, and ChartQA against general VLMs of comparable scale, and attains the highest average score on public key information extraction benchmarks, surpassing Gemini-3.1-Pro, Seed-2.0, and Qwen3-VL-235B. The model is publicly accessible via the Baidu AI Cloud Qianfan platform.

Qianfan-OCR: Ein vereinheitlichtes End-to-End-Modell für Document Intelligence

Qianfan-OCR: A Unified End-to-End Model for Document Intelligence

Zusammenfassung

Support