Multimodale Texterkennung: Alles aus Dokumenten analysieren

Zusammenfassung

Wir stellen Multimodal OCR (MOCR) vor, ein neues Paradigma zur Dokumentenanalyse, das Text und Grafiken gemeinsam in einheitliche textuelle Repräsentationen überführt. Im Gegensatz zu konventionellen OCR-Systemen, die sich auf die Texterkennung konzentrieren und grafische Bereiche als zugeschnittene Pixel belassen, behandelt unsere Methode, dots.mocr genannt, visuelle Elemente wie Diagramme, Tabellen und Symbole als gleichberechtigte Analyseobjekte. Dies ermöglicht es Systemen, Dokumente zu analysieren und dabei semantische Beziehungen zwischen den Elementen zu bewahren. Das Verfahren bietet mehrere Vorteile: (1) Es rekonstruiert sowohl Text als auch Grafiken als strukturierte Ausgaben, was eine originalgetreuere Dokumentenrekonstruktion ermöglicht; (2) es unterstützt End-to-End-Training mit heterogenen Dokumentelementen, sodass Modelle semantische Relationen zwischen textuellen und visuellen Komponenten nutzen können; und (3) es wandelt bisher verworfenen Grafikinhalt in wiederverwendbare Code-basierte Supervision um und erschließt so die in bestehenden Dokumenten enthaltene multimodale Aufsicht. Um dieses Paradigma in großem Maßstab praktikabel zu machen, haben wir eine umfassende Daten-Engine aus PDFs, gerenderten Webseiten und nativen SVG-Assets aufgebaut und ein kompaktes 3-Milliarden-Parameter-Modell durch gestuftes Pre-Training und überwachtes Fine-Tuning trainiert. Wir evaluieren dots.mocr aus zwei Perspektiven: Dokumentenanalyse und strukturierte Grafikanalyse. Auf Dokumentenanalyse-Benchmarks belegt es auf unserer OCR Arena Elo-Rangliste den zweiten Platz direkt hinter Gemini 3 Pro, übertrifft bestehende Open-Source-Dokumentenanalysesysteme und setzt mit 83,9 Punkten einen neuen State-of-the-Art-Wert auf olmOCR Bench. Bei der strukturierten Grafikanalyse erzielt dots.mocr eine höhere Rekonstruktionsqualität als Gemini 3 Pro über verschiedene Image-to-SVG-Benchmarks hinweg und zeigt starke Leistungen bei Diagrammen, UI-Layouts, wissenschaftlichen Abbildungen und chemischen Strukturformeln. Diese Ergebnisse zeigen einen skalierbaren Weg zur Erstellung großangelegter Image-to-Code-Korpora für multimodales Pre-Training. Code und Modelle sind öffentlich verfügbar unter https://github.com/rednote-hilab/dots.mocr.

English

We present Multimodal OCR (MOCR), a document parsing paradigm that jointly parses text and graphics into unified textual representations. Unlike conventional OCR systems that focus on text recognition and leave graphical regions as cropped pixels, our method, termed dots.mocr, treats visual elements such as charts, diagrams, tables, and icons as first-class parsing targets, enabling systems to parse documents while preserving semantic relationships across elements. It offers several advantages: (1) it reconstructs both text and graphics as structured outputs, enabling more faithful document reconstruction; (2) it supports end-to-end training over heterogeneous document elements, allowing models to exploit semantic relations between textual and visual components; and (3) it converts previously discarded graphics into reusable code-level supervision, unlocking multimodal supervision embedded in existing documents. To make this paradigm practical at scale, we build a comprehensive data engine from PDFs, rendered webpages, and native SVG assets, and train a compact 3B-parameter model through staged pretraining and supervised fine-tuning. We evaluate dots.mocr from two perspectives: document parsing and structured graphics parsing. On document parsing benchmarks, it ranks second only to Gemini 3 Pro on our OCR Arena Elo leaderboard, surpasses existing open-source document parsing systems, and sets a new state of the art of 83.9 on olmOCR Bench. On structured graphics parsing, dots.mocr achieves higher reconstruction quality than Gemini 3 Pro across image-to-SVG benchmarks, demonstrating strong performance on charts, UI layouts, scientific figures, and chemical diagrams. These results show a scalable path toward building large-scale image-to-code corpora for multimodal pretraining. Code and models are publicly available at https://github.com/rednote-hilab/dots.mocr.

Multimodale Texterkennung: Alles aus Dokumenten analysieren

Multimodal OCR: Parse Anything from Documents

Zusammenfassung

Support