VisFocus：面向OCR-Free密集文档理解的提示引导视觉编码器

摘要

近年来，在视觉文档理解领域取得了显著进展，主要架构包括一系列视觉和语言模型。文本部分可以通过在基于OCR的方法中明确提取文本，使用外部OCR模型，或者在无OCR的方法中，视觉模型可以具备阅读能力。通常，向模型提出的查询仅输入到语言部分，需要视觉特征涵盖整个文档。在本文中，我们提出了VisFocus，这是一种无OCR方法，旨在通过直接将其与语言提示相结合，更好地利用视觉编码器的能力。为此，我们用接收输入提示的层替换了下采样层，并允许突出显示文档的相关部分，而忽略其他部分。我们将架构增强与一项新颖的预训练任务相结合，使用语言掩码处理文档文本片段，将其馈送到视觉编码器，代替提示，以赋予模型聚焦能力。因此，VisFocus学会将注意力集中在与提供的提示相关的文本片段上。我们的实验表明，这种提示引导的视觉编码方法显著提高了性能，在各种基准测试中取得了最先进的结果。

English

In recent years, notable advancements have been made in the domain of visual document understanding, with the prevailing architecture comprising a cascade of vision and language models. The text component can either be extracted explicitly with the use of external OCR models in OCR-based approaches, or alternatively, the vision model can be endowed with reading capabilities in OCR-free approaches. Typically, the queries to the model are input exclusively to the language component, necessitating the visual features to encompass the entire document. In this paper, we present VisFocus, an OCR-free method designed to better exploit the vision encoder's capacity by coupling it directly with the language prompt. To do so, we replace the down-sampling layers with layers that receive the input prompt and allow highlighting relevant parts of the document, while disregarding others. We pair the architecture enhancements with a novel pre-training task, using language masking on a snippet of the document text fed to the visual encoder in place of the prompt, to empower the model with focusing capabilities. Consequently, VisFocus learns to allocate its attention to text patches pertinent to the provided prompt. Our experiments demonstrate that this prompt-guided visual encoding approach significantly improves performance, achieving state-of-the-art results on various benchmarks.

VisFocus：面向OCR-Free密集文档理解的提示引导视觉编码器

VisFocus: Prompt-Guided Vision Encoders for OCR-Free Dense Document Understanding

摘要

Support