現代VBERT：邁向更精簡的視覺文件檢索器

摘要

多模態嵌入模型正逐漸普及，特別是在文檔檢索領域，作為僅依賴文本的流程的高效替代方案。這些模型通常通過在文本-圖像對上使用對比損失來微調大型視覺語言解碼器（VLMs）而構建。在本研究中，我們表明，儘管這種再利用方法成本效益高，但它往往會成為檢索性能的瓶頸。通過對照實驗，我們建立了一套改進視覺文檔檢索模型的原理性方案。我們特別評估了注意力遮罩、圖像分辨率、模態對齊數據機制以及以晚期交互為核心的對比目標對性能的影響，這些因素被證明是關鍵的性能影響因素。基於這些見解，我們發布了ModernVBERT，這是一個擁有2.5億參數的緊湊型視覺語言編碼器，在文檔檢索任務上微調後，其性能超越了規模是其10倍的模型。模型和代碼可在https://huggingface.co/ModernVBERT獲取。

English

Multimodal embedding models are gaining prevalence, notably for document retrieval as efficient alternatives to text-only pipelines. These models are typically built by finetuning large vision-language decoders (VLMs) with contrastive losses on text-image pairs. In this work, we show that, while cost-efficient, this repurposing approach often bottlenecks retrieval performance. Through controlled experiments, we establish a principled recipe for improving visual document retrieval models. We notably measure the impact of attention masking, image resolution, modality alignment data regimes, and late interaction centered contrastive objectives which emerge as central performance factors. Building on these insights, we release ModernVBERT, a compact 250M-parameter vision-language encoder that outperforms models up to 10 times larger when finetuned on document retrieval tasks. Models and code are made available at https://huggingface.co/ModernVBERT.

現代VBERT：邁向更精簡的視覺文件檢索器

ModernVBERT: Towards Smaller Visual Document Retrievers

摘要

Support