VisFocus：針對OCR-Free密集文件理解的提示引導視覺編碼器

摘要

近年來，在視覺文件理解領域取得了顯著進展，主要架構包括一系列視覺和語言模型的串聯。文字組件可以透過在基於OCR方法中使用外部OCR模型來明確提取，或者在無OCR方法中，也可以賦予視覺模型閱讀能力。通常，對模型的查詢僅輸入到語言組件，使得視覺特徵需要涵蓋整個文件。本文介紹了VisFocus，一種無OCR方法，旨在通過將視覺編碼器直接與語言提示相結合，更好地利用其能力。為此，我們將下採樣層替換為接收輸入提示並允許突出文件中相關部分的層，同時忽略其他部分。我們將架構增強與一項新穎的預訓練任務相結合，使用語言遮罩在提供給視覺編碼器的文件文本片段上進行遮罩，以賦予模型專注能力。因此，VisFocus學會將注意力集中在與提供的提示相關的文本片段上。我們的實驗表明，這種提示引導的視覺編碼方法顯著提高了性能，在各種基準測試中取得了最先進的結果。

English

In recent years, notable advancements have been made in the domain of visual document understanding, with the prevailing architecture comprising a cascade of vision and language models. The text component can either be extracted explicitly with the use of external OCR models in OCR-based approaches, or alternatively, the vision model can be endowed with reading capabilities in OCR-free approaches. Typically, the queries to the model are input exclusively to the language component, necessitating the visual features to encompass the entire document. In this paper, we present VisFocus, an OCR-free method designed to better exploit the vision encoder's capacity by coupling it directly with the language prompt. To do so, we replace the down-sampling layers with layers that receive the input prompt and allow highlighting relevant parts of the document, while disregarding others. We pair the architecture enhancements with a novel pre-training task, using language masking on a snippet of the document text fed to the visual encoder in place of the prompt, to empower the model with focusing capabilities. Consequently, VisFocus learns to allocate its attention to text patches pertinent to the provided prompt. Our experiments demonstrate that this prompt-guided visual encoding approach significantly improves performance, achieving state-of-the-art results on various benchmarks.

VisFocus：針對OCR-Free密集文件理解的提示引導視覺編碼器

VisFocus: Prompt-Guided Vision Encoders for OCR-Free Dense Document Understanding

摘要

Support