VisFocus:針對OCR-Free密集文件理解的提示引導視覺編碼器
VisFocus: Prompt-Guided Vision Encoders for OCR-Free Dense Document Understanding
July 17, 2024
作者: Ofir Abramovich, Niv Nayman, Sharon Fogel, Inbal Lavi, Ron Litman, Shahar Tsiper, Royee Tichauer, Srikar Appalaraju, Shai Mazor, R. Manmatha
cs.AI
摘要
近年來,在視覺文件理解領域取得了顯著進展,主要架構包括一系列視覺和語言模型的串聯。文字組件可以透過在基於OCR方法中使用外部OCR模型來明確提取,或者在無OCR方法中,也可以賦予視覺模型閱讀能力。通常,對模型的查詢僅輸入到語言組件,使得視覺特徵需要涵蓋整個文件。本文介紹了VisFocus,一種無OCR方法,旨在通過將視覺編碼器直接與語言提示相結合,更好地利用其能力。為此,我們將下採樣層替換為接收輸入提示並允許突出文件中相關部分的層,同時忽略其他部分。我們將架構增強與一項新穎的預訓練任務相結合,使用語言遮罩在提供給視覺編碼器的文件文本片段上進行遮罩,以賦予模型專注能力。因此,VisFocus學會將注意力集中在與提供的提示相關的文本片段上。我們的實驗表明,這種提示引導的視覺編碼方法顯著提高了性能,在各種基準測試中取得了最先進的結果。
English
In recent years, notable advancements have been made in the domain of visual
document understanding, with the prevailing architecture comprising a cascade
of vision and language models. The text component can either be extracted
explicitly with the use of external OCR models in OCR-based approaches, or
alternatively, the vision model can be endowed with reading capabilities in
OCR-free approaches. Typically, the queries to the model are input exclusively
to the language component, necessitating the visual features to encompass the
entire document. In this paper, we present VisFocus, an OCR-free method
designed to better exploit the vision encoder's capacity by coupling it
directly with the language prompt. To do so, we replace the down-sampling
layers with layers that receive the input prompt and allow highlighting
relevant parts of the document, while disregarding others. We pair the
architecture enhancements with a novel pre-training task, using language
masking on a snippet of the document text fed to the visual encoder in place of
the prompt, to empower the model with focusing capabilities. Consequently,
VisFocus learns to allocate its attention to text patches pertinent to the
provided prompt. Our experiments demonstrate that this prompt-guided visual
encoding approach significantly improves performance, achieving
state-of-the-art results on various benchmarks.Summary
AI-Generated Summary