DODO：离散光学字符识别扩散模型

摘要

光学字符识别（OCR）作为数字化信息的基础任务，是连接视觉数据与文本理解的关键桥梁。尽管现代视觉语言模型（VLM）在该领域已实现高精度识别，但其主要依赖自回归解码方式——由于每个生成标记都需要顺序前向传播，导致长文档处理时计算成本高昂且速度缓慢。我们发现突破此瓶颈的关键机遇：与开放式生成任务不同，OCR是高度确定性的任务，视觉输入严格对应唯一输出序列，理论上可通过扩散模型实现高效的并行解码。然而，现有掩码扩散模型未能发挥此潜力：它们引入的结构不稳定性在图像描述等柔性任务中影响轻微，但对OCR严格的精确匹配要求却可能造成灾难性后果。为此，我们提出DODO模型——首个采用块离散扩散技术的VLM，成功释放扩散模型在OCR任务中的加速潜能。通过将生成过程分解为块级操作，DODO有效规避了全局扩散的同步误差。实验表明，本方法在保持接近最先进精度的同时，推理速度较自回归基线最高提升3倍。

English

Optical Character Recognition (OCR) is a fundamental task for digitizing information, serving as a critical bridge between visual data and textual understanding. While modern Vision-Language Models (VLM) have achieved high accuracy in this domain, they predominantly rely on autoregressive decoding, which becomes computationally expensive and slow for long documents as it requires a sequential forward pass for every generated token. We identify a key opportunity to overcome this bottleneck: unlike open-ended generation, OCR is a highly deterministic task where the visual input strictly dictates a unique output sequence, theoretically enabling efficient, parallel decoding via diffusion models. However, we show that existing masked diffusion models fail to harness this potential; those introduce structural instabilities that are benign in flexible tasks, like captioning, but catastrophic for the rigid, exact-match requirements of OCR. To bridge this gap, we introduce DODO, the first VLM to utilize block discrete diffusion and unlock its speedup potential for OCR. By decomposing generation into blocks, DODO mitigates the synchronization errors of global diffusion. Empirically, our method achieves near state-of-the-art accuracy while enabling up to 3x faster inference compared to autoregressive baselines.

DODO：离散光学字符识别扩散模型

DODO: Discrete OCR Diffusion Models

摘要

Support