MinerU-Diffusion：通过扩散解码将文档OCR重新构想为逆向渲染

摘要

光学字符识别（OCR）已从行级转录演进至结构化文档解析，要求模型能够还原包含版式、表格和公式的长序列内容。尽管视觉语言模型近期取得进展，现有系统大多仍依赖自回归解码，这种机制会引入序列延迟并在长文档中放大错误传播。本文从逆向渲染的角度重新审视文档OCR任务，指出从左到右的因果生成本质上是序列化的副产品而非任务的内在属性。基于这一洞见，我们提出MinerU-Diffusion——一个基于扩散模型的统一框架，通过视觉条件化下的并行扩散去噪替代自回归序列解码。该框架采用分块扩散解码器和不确定性驱动的课程学习策略，实现稳定训练与高效长序列推理。大量实验表明，MinerU-Diffusion在实现比自回归基线快3.2倍解码速度的同时，持续提升鲁棒性。在提出的语义重排基准测试上的评估进一步证实，该方法降低了对语言先验的依赖，展现出更强的视觉OCR能力。

English

Optical character recognition (OCR) has evolved from line-level transcription to structured document parsing, requiring models to recover long-form sequences containing layout, tables, and formulas. Despite recent advances in vision-language models, most existing systems rely on autoregressive decoding, which introduces sequential latency and amplifies error propagation in long documents. In this work, we revisit document OCR from an inverse rendering perspective, arguing that left-to-right causal generation is an artifact of serialization rather than an intrinsic property of the task. Motivated by this insight, we propose MinerU-Diffusion, a unified diffusion-based framework that replaces autoregressive sequential decoding with parallel diffusion denoising under visual conditioning. MinerU-Diffusion employs a block-wise diffusion decoder and an uncertainty-driven curriculum learning strategy to enable stable training and efficient long-sequence inference. Extensive experiments demonstrate that MinerU-Diffusion consistently improves robustness while achieving up to 3.2x faster decoding compared to autoregressive baselines. Evaluations on the proposed Semantic Shuffle benchmark further confirm its reduced dependence on linguistic priors and stronger visual OCR capability.