並列トークン予測による効率的な文書解析

要旨

文書解析は、基礎的でありながら極めて重要な視覚タスクとして、視覚言語モデル（VLM）によって革新が進められている。しかし、VLMに内在する自己回帰的（AR）復号化は重大なボトルネックとなり、解析速度を大幅に制限している。本論文では、並列トークン予測（PTP）を提案する。これは、VLMが将来の複数のトークンを並列生成し、サンプル効率を向上させることを可能にする、プラグイン可能でモデル非依存、かつシンプルでありながら効果的な手法である。具体的には、学習可能なトークンを入力シーケンスに挿入し、対応する訓練目標を設計することで、モデルに文書解析のための並列復号化能力を付与する。さらに、効果的な訓練を支援するため、VLM向けの大規模で高品質な文書解析訓練データを効率的に生成する包括的なデータ生成パイプラインを開発した。OmniDocBenchおよびolmOCR-benchにおける大規模な実験により、本手法が復号化速度を大幅に向上させる（1.6倍～2.2倍）だけでなく、モデルの hallucination を低減し、強力な汎化能力を示すことが実証された。

English

Document parsing, as a fundamental yet crucial vision task, is being revolutionized by vision-language models (VLMs). However, the autoregressive (AR) decoding inherent to VLMs creates a significant bottleneck, severely limiting parsing speed. In this paper, we propose Parallel-Token Prediction (PTP), a plugable, model-agnostic and simple-yet-effective method that enables VLMs to generate multiple future tokens in parallel with improved sample efficiency. Specifically, we insert some learnable tokens into the input sequence and design corresponding training objectives to equip the model with parallel decoding capabilities for document parsing. Furthermore, to support effective training, we develop a comprehensive data generation pipeline that efficiently produces large-scale, high-quality document parsing training data for VLMs. Extensive experiments on OmniDocBench and olmOCR-bench demonstrate that our method not only significantly improves decoding speed (1.6x-2.2x) but also reduces model hallucinations and exhibits strong generalization abilities.

並列トークン予測による効率的な文書解析

Efficient Document Parsing via Parallel Token Prediction

要旨

Support