MLAIRE: 多语言语言感知信息检索评估协议

摘要

多语言信息检索在实际搜索场景中日益重要，用户经常在混合语言语料库上发起查询。现有评估主要奖励语言无关的语义相关性，将相关段落同等对待而不考虑语言差异。然而，检索效用还取决于检索到的段落的语言：用户可能更倾向于能阅读并以查询语言进行验证的结果，而查询-段落语言不匹配可能会使检索增强生成系统中的下游基础验证和答案验证复杂化。为了评估这一语言感知维度，我们提出了MLAIRE，一种多语言语言感知信息检索评估协议，它将跨语言语义检索与查询语言偏好分离开来。MLAIRE通过构建包含跨语言平行段落的受控语料池，使得在存在等效翻译时能够测量语义检索准确率和查询语言偏好。我们提出了语言感知指标，包括语言偏好率（LPR）和Lang-nDCG，以及一种将语义和查询语言偏好失败区分的四路分解方法。通过评估31种密集、稀疏和延迟交互检索器，我们表明标准指标掩盖了不同行为：语义强的检索器可能以非查询语言返回正确内容，而查询语言偏好更强的检索器可能检索到语义相关性较低的段落。

English

Multilingual Information Retrieval is increasingly important in real-world search settings, where users issue queries over mixed-language corpora. Existing evaluations mainly reward language-agnostic semantic relevance, treating relevant passages equally regardless of language. Yet retrieval utility also depends on the language of the retrieved passages: users may prefer results they can read and verify in the query language, and query--passage language mismatch can complicate downstream grounding and answer verification in Retrieval-Augmented Generation systems. To evaluate this language-aware dimension, we introduce MLAIRE, a Multilingual Language-Aware Information Retrieval Evaluation protocol that disentangles cross-lingual semantic retrieval from query-language preference. MLAIRE constructs controlled pools with parallel passages across languages, enabling measurement of semantic retrieval accuracy and query-language preference when equivalent translations are available. We propose language-aware metrics, including Language Preference Rate (LPR) and Lang-nDCG, together with a 4-way decomposition separating semantic and query-language preference failures. Evaluating 31 dense, sparse, and late-interaction retrievers, we show that standard metrics obscure distinct behaviors: semantically strong retrievers may return correct content in a non-query language, while retrievers with stronger query-language preference may retrieve less semantically relevant passages.