ChatPaper.aiChatPaper

M3DR:迈向通用多语言多模态文档检索

M3DR: Towards Universal Multilingual Multimodal Document Retrieval

December 3, 2025
作者: Adithya S Kolavi, Vyoman Jain
cs.AI

摘要

多模态文档检索系统在视觉与文本内容的语义对齐方面已取得显著进展,但现有方法仍严重以英语为中心,限制了其在多语言环境中的有效性。本研究提出M3DR(多语言多模态文档检索)框架,旨在跨越语言鸿沟,使其能适应不同语言文化场景。M3DR利用合成多语言文档数据,可泛化至不同视觉-语言架构与模型规模,实现稳健的跨语言跨模态对齐。通过对比学习训练,我们的模型能学习文本与文档图像的通用表征,并有效迁移至不同语言。我们在22种类型各异的语言上验证了这一能力,证明其在不同语言和文字变体间具有持续稳定的性能表现。我们还构建了涵盖真实多语言场景的综合基准,在单语、多语及混合语言设置下评估模型性能。M3DR可同时兼容单稠密向量与ColBERT风格的令牌级多向量检索范式。我们的NetraEmbed与ColNetraEmbed模型实现了跨语言检索性能约150%的相对提升,达到当前最优水平。
English
Multimodal document retrieval systems have shown strong progress in aligning visual and textual content for semantic search. However, most existing approaches remain heavily English-centric, limiting their effectiveness in multilingual contexts. In this work, we present M3DR (Multilingual Multimodal Document Retrieval), a framework designed to bridge this gap across languages, enabling applicability across diverse linguistic and cultural contexts. M3DR leverages synthetic multilingual document data and generalizes across different vision-language architectures and model sizes, enabling robust cross-lingual and cross-modal alignment. Using contrastive training, our models learn unified representations for text and document images that transfer effectively across languages. We validate this capability on 22 typologically diverse languages, demonstrating consistent performance and adaptability across linguistic and script variations. We further introduce a comprehensive benchmark that captures real-world multilingual scenarios, evaluating models under monolingual, multilingual, and mixed-language settings. M3DR generalizes across both single dense vector and ColBERT-style token-level multi-vector retrieval paradigms. Our models, NetraEmbed and ColNetraEmbed achieve state-of-the-art performance with ~150% relative improvements on cross-lingual retrieval.
PDF72December 9, 2025