基于并行令牌预测的高效文档解析

摘要

文件解析作为一项基础却关键的视觉任务，正受到视觉语言模型（VLM）的革命性影响。然而，VLM固有的自回归解码机制形成了显著瓶颈，严重制约了解析速度。本文提出并行令牌预测（PTP），这是一种可插拔、模型无关且简洁高效的方法，能使VLM在提升样本效率的同时并行生成多个未来令牌。具体而言，我们在输入序列中插入可学习的令牌，并通过设计相应的训练目标赋予模型并行解码能力。此外，为支持有效训练，我们开发了综合数据生成流程，可高效产出大规模、高质量的VLM文件解析训练数据。在OmniDocBench和olmOCR-bench上的大量实验表明，该方法不仅显著提升解码速度（1.6-2.2倍），还能减少模型幻觉并展现出强大的泛化能力。

English

Document parsing, as a fundamental yet crucial vision task, is being revolutionized by vision-language models (VLMs). However, the autoregressive (AR) decoding inherent to VLMs creates a significant bottleneck, severely limiting parsing speed. In this paper, we propose Parallel-Token Prediction (PTP), a plugable, model-agnostic and simple-yet-effective method that enables VLMs to generate multiple future tokens in parallel with improved sample efficiency. Specifically, we insert some learnable tokens into the input sequence and design corresponding training objectives to equip the model with parallel decoding capabilities for document parsing. Furthermore, to support effective training, we develop a comprehensive data generation pipeline that efficiently produces large-scale, high-quality document parsing training data for VLMs. Extensive experiments on OmniDocBench and olmOCR-bench demonstrate that our method not only significantly improves decoding speed (1.6x-2.2x) but also reduces model hallucinations and exhibits strong generalization abilities.

基于并行令牌预测的高效文档解析

Efficient Document Parsing via Parallel Token Prediction

摘要

Support