基于并行令牌预测的高效文档解析

摘要

文档解析作为一项基础而关键的视觉任务，正受到视觉语言模型（VLM）的革命性影响。然而，VLM固有的自回归解码机制形成了显著瓶颈，严重限制了解析速度。本文提出并行令牌预测（PTP），这是一种可插拔、模型无关且简洁高效的方法，能使VLM以改进的样本效率并行生成多个未来令牌。具体而言，我们在输入序列中插入可学习的令牌，并设计相应训练目标，使模型获得面向文档解析的并行解码能力。此外，为支持有效训练，我们开发了综合数据生成流程，可高效产出面向VLM的大规模高质量文档解析训练数据。在OmniDocBench和olmOCR-bench上的大量实验表明，该方法不仅显著提升了解码速度（1.6-2.2倍），同时减少了模型幻觉现象，并展现出强大的泛化能力。

English

Document parsing, as a fundamental yet crucial vision task, is being revolutionized by vision-language models (VLMs). However, the autoregressive (AR) decoding inherent to VLMs creates a significant bottleneck, severely limiting parsing speed. In this paper, we propose Parallel-Token Prediction (PTP), a plugable, model-agnostic and simple-yet-effective method that enables VLMs to generate multiple future tokens in parallel with improved sample efficiency. Specifically, we insert some learnable tokens into the input sequence and design corresponding training objectives to equip the model with parallel decoding capabilities for document parsing. Furthermore, to support effective training, we develop a comprehensive data generation pipeline that efficiently produces large-scale, high-quality document parsing training data for VLMs. Extensive experiments on OmniDocBench and olmOCR-bench demonstrate that our method not only significantly improves decoding speed (1.6x-2.2x) but also reduces model hallucinations and exhibits strong generalization abilities.