MinerU-Diffusion：將文件光學字元識別重新定義為基於擴散解碼的逆向渲染

摘要

光學字元辨識（OCR）已從行級轉錄發展至結構化文件解析，要求模型能還原包含版面佈局、表格和公式的長序列內容。儘管視覺語言模型近期取得進展，現有系統大多仍依賴自迴歸解碼，這種方式會引入序列延遲，並在長文件中放大錯誤傳播。本研究從逆向渲染的角度重新審視文件OCR，提出從左到右的因果生成實際上是序列化處理的副產品，而非任務的內在屬性。基於此洞見，我們提出MinerU-Diffusion——一個統一的基於擴散模型的框架，通過視覺條件下的並行擴散去噪取代自迴歸序列解碼。該框架採用分塊擴散解碼器與不確定性驅動的課程學習策略，實現穩定訓練與高效長序列推理。大量實驗表明，MinerU-Diffusion在提升魯棒性的同時，解碼速度較自迴歸基線最高加快3.2倍。在我們提出的語義重排基準測試中，其對語言先驗的依賴性更低，展現出更強的視覺OCR能力。

English

Optical character recognition (OCR) has evolved from line-level transcription to structured document parsing, requiring models to recover long-form sequences containing layout, tables, and formulas. Despite recent advances in vision-language models, most existing systems rely on autoregressive decoding, which introduces sequential latency and amplifies error propagation in long documents. In this work, we revisit document OCR from an inverse rendering perspective, arguing that left-to-right causal generation is an artifact of serialization rather than an intrinsic property of the task. Motivated by this insight, we propose MinerU-Diffusion, a unified diffusion-based framework that replaces autoregressive sequential decoding with parallel diffusion denoising under visual conditioning. MinerU-Diffusion employs a block-wise diffusion decoder and an uncertainty-driven curriculum learning strategy to enable stable training and efficient long-sequence inference. Extensive experiments demonstrate that MinerU-Diffusion consistently improves robustness while achieving up to 3.2x faster decoding compared to autoregressive baselines. Evaluations on the proposed Semantic Shuffle benchmark further confirm its reduced dependence on linguistic priors and stronger visual OCR capability.

MinerU-Diffusion：將文件光學字元識別重新定義為基於擴散解碼的逆向渲染

MinerU-Diffusion: Rethinking Document OCR as Inverse Rendering via Diffusion Decoding

摘要

Support