通过文档截图嵌入实现多模态检索的统一化
Unifying Multimodal Retrieval via Document Screenshot Embedding
June 17, 2024
作者: Xueguang Ma, Sheng-Chieh Lin, Minghan Li, Wenhu Chen, Jimmy Lin
cs.AI
摘要
在现实世界中,文档以不同格式和多样的形式进行组织。传统的检索流程需要定制的文档解析技术和内容提取模块来准备索引的输入。这一过程繁琐、容易出错,并且存在信息丢失的问题。为此,我们提出了文档截图嵌入(DSE),这是一种新颖的检索范式,将文档截图视为统一的输入格式,无需任何内容提取预处理,并保留文档中的所有信息(如文本、图像和布局)。DSE利用大型视觉-语言模型,直接将文档截图编码为用于检索的密集表示。为了评估我们的方法,我们首先创建了Wiki-SS数据集,这是一个包含130万条维基百科网页截图的语料库,用于回答自然问题数据集中的问题。在这种文本密集型文档检索设置中,DSE表现出与依赖解析的其他文本检索方法相比具有竞争力的有效性。例如,在top-1检索准确性方面,DSE比BM25高出17个百分点。此外,在幻灯片检索的混合模态任务中,DSE在nDCG@10方面明显优于OCR文本检索方法超过15个百分点。这些实验表明,DSE是一种有效的适用于各种文档类型的文档检索范式。模型检查点、代码和Wiki-SS收集将会发布。
English
In the real world, documents are organized in different formats and varied
modalities. Traditional retrieval pipelines require tailored document parsing
techniques and content extraction modules to prepare input for indexing. This
process is tedious, prone to errors, and has information loss. To this end, we
propose Document Screenshot Embedding} (DSE), a novel retrieval paradigm that
regards document screenshots as a unified input format, which does not require
any content extraction preprocess and preserves all the information in a
document (e.g., text, image and layout). DSE leverages a large vision-language
model to directly encode document screenshots into dense representations for
retrieval. To evaluate our method, we first craft the dataset of Wiki-SS, a
1.3M Wikipedia web page screenshots as the corpus to answer the questions from
the Natural Questions dataset. In such a text-intensive document retrieval
setting, DSE shows competitive effectiveness compared to other text retrieval
methods relying on parsing. For example, DSE outperforms BM25 by 17 points in
top-1 retrieval accuracy. Additionally, in a mixed-modality task of slide
retrieval, DSE significantly outperforms OCR text retrieval methods by over 15
points in nDCG@10. These experiments show that DSE is an effective document
retrieval paradigm for diverse types of documents. Model checkpoints, code, and
Wiki-SS collection will be released.Summary
AI-Generated Summary