通過文件截圖嵌入實現多模檢索的統一

摘要

在現實世界中，文件以不同格式和多樣模式組織。傳統的檢索流程需要定制的文件解析技術和內容提取模組來準備索引的輸入。這個過程繁瑣、容易出錯，並且存在信息損失。為此，我們提出了「文件截圖嵌入」（DSE），這是一種新穎的檢索範式，將文件截圖視為統一的輸入格式，無需任何內容提取預處理，並保留文件中的所有信息（例如文本、圖像和版面設計）。DSE利用大型視覺語言模型將文件截圖直接編碼為用於檢索的密集表示。為了評估我們的方法，我們首先創建了Wiki-SS數據集，其中包含130萬個維基百科網頁截圖作為語料庫，以回答自然問題數據集中的問題。在這種文本密集型文件檢索設置中，DSE相對於依賴解析的其他文本檢索方法表現出競爭力。例如，在頂部1的檢索準確性方面，DSE比BM25高出17個百分點。此外，在幻燈片檢索的混合模式任務中，DSE在nDCG@10方面明顯優於OCR文本檢索方法超過15個百分點。這些實驗表明，DSE是一種對各種類型文件有效的檢索範式。模型檢查點、代碼和Wiki-SS收集將被釋出。

English

In the real world, documents are organized in different formats and varied modalities. Traditional retrieval pipelines require tailored document parsing techniques and content extraction modules to prepare input for indexing. This process is tedious, prone to errors, and has information loss. To this end, we propose Document Screenshot Embedding} (DSE), a novel retrieval paradigm that regards document screenshots as a unified input format, which does not require any content extraction preprocess and preserves all the information in a document (e.g., text, image and layout). DSE leverages a large vision-language model to directly encode document screenshots into dense representations for retrieval. To evaluate our method, we first craft the dataset of Wiki-SS, a 1.3M Wikipedia web page screenshots as the corpus to answer the questions from the Natural Questions dataset. In such a text-intensive document retrieval setting, DSE shows competitive effectiveness compared to other text retrieval methods relying on parsing. For example, DSE outperforms BM25 by 17 points in top-1 retrieval accuracy. Additionally, in a mixed-modality task of slide retrieval, DSE significantly outperforms OCR text retrieval methods by over 15 points in nDCG@10. These experiments show that DSE is an effective document retrieval paradigm for diverse types of documents. Model checkpoints, code, and Wiki-SS collection will be released.

通過文件截圖嵌入實現多模檢索的統一

Unifying Multimodal Retrieval via Document Screenshot Embedding

摘要

Support