병렬 토큰 예측을 통한 효율적인 문서 파싱

초록

문서 파싱은 기본적이면서도 핵심적인 비전 과제로서, 비전-언어 모델(VLM)에 의해 혁신을 맞이하고 있습니다. 그러나 VLM에 내재된 자기회귀(AR) 디코딩은 심각한 병목 현상을 일으켜 파싱 속도를 크게 제한합니다. 본 논문에서는 VLM이 향상된 샘플 효율성으로 여러 미래 토큰을 병렬 생성할 수 있도록 하는, 플러그인 가능하고 모델에 구애받지 않으며 간단하지만 효과적인 방법인 병렬 토큰 예측(PTP)을 제안합니다. 구체적으로, 입력 시퀀스에 학습 가능한 토큰을 삽입하고 해당 훈련 목표를 설계하여 모델에 문서 파싱을 위한 병렬 디코딩 능력을 부여합니다. 더 나아가 효과적인 훈련을 지원하기 위해 VLM을 위한 대규모 고품질 문서 파싱 훈련 데이터를 효율적으로 생성하는 포괄적인 데이터 생성 파이프라인을 개발했습니다. OmniDocBench와 olmOCR-bench에서의 광범위한 실험을 통해 우리 방법이 디코딩 속도를 크게 향상시키고(1.6x-2.2x), 모델 환각을 줄이며 강력한 일반화 능력을 보여준다는 것을 입증했습니다.

English

Document parsing, as a fundamental yet crucial vision task, is being revolutionized by vision-language models (VLMs). However, the autoregressive (AR) decoding inherent to VLMs creates a significant bottleneck, severely limiting parsing speed. In this paper, we propose Parallel-Token Prediction (PTP), a plugable, model-agnostic and simple-yet-effective method that enables VLMs to generate multiple future tokens in parallel with improved sample efficiency. Specifically, we insert some learnable tokens into the input sequence and design corresponding training objectives to equip the model with parallel decoding capabilities for document parsing. Furthermore, to support effective training, we develop a comprehensive data generation pipeline that efficiently produces large-scale, high-quality document parsing training data for VLMs. Extensive experiments on OmniDocBench and olmOCR-bench demonstrate that our method not only significantly improves decoding speed (1.6x-2.2x) but also reduces model hallucinations and exhibits strong generalization abilities.

병렬 토큰 예측을 통한 효율적인 문서 파싱

Efficient Document Parsing via Parallel Token Prediction

초록

Support