MLAIRE：多語言語言感知資訊檢索評估協議

摘要

多語言資訊檢索（Multilingual Information Retrieval）在真實世界的搜尋情境中日趨重要，使用者常在混合語言的語料庫中提出查詢。現有評測主要獎勵與語言無關的語義相關性，將相關段落視為同等，無論其語言為何。然而，檢索的實用性亦取決於檢索段落的語言：使用者可能偏好能以查詢語言閱讀及驗證的結果；而查詢與段落間的語言不匹配，則可能使檢索增強生成系統中的下游實體化與答案驗證複雜化。為評估此語言感知維度，我們提出 MLAIRE（多語言語言感知資訊檢索評測協定），該協定能釐清跨語言語義檢索與查詢語言偏好之間的關係。MLAIRE 建構包含多語言平行段落的受控池，從而能在提供等效翻譯時，分別量測語義檢索準確度與查詢語言偏好。我們提出語言感知指標，包括語言偏好率（LPR）及 Lang-nDCG，並搭配四向分解法，將語義檢索失敗與查詢語言偏好失敗區分開來。透過評估 31 種密集式、稀疏式及晚期交互檢索器，我們顯示標準指標掩蓋了不同行為：語義表現強的檢索器可能以非查詢語言回傳正確內容，而查詢語言偏好較強的檢索器則可能檢索到語義相關性較低的段落。

English

Multilingual Information Retrieval is increasingly important in real-world search settings, where users issue queries over mixed-language corpora. Existing evaluations mainly reward language-agnostic semantic relevance, treating relevant passages equally regardless of language. Yet retrieval utility also depends on the language of the retrieved passages: users may prefer results they can read and verify in the query language, and query--passage language mismatch can complicate downstream grounding and answer verification in Retrieval-Augmented Generation systems. To evaluate this language-aware dimension, we introduce MLAIRE, a Multilingual Language-Aware Information Retrieval Evaluation protocol that disentangles cross-lingual semantic retrieval from query-language preference. MLAIRE constructs controlled pools with parallel passages across languages, enabling measurement of semantic retrieval accuracy and query-language preference when equivalent translations are available. We propose language-aware metrics, including Language Preference Rate (LPR) and Lang-nDCG, together with a 4-way decomposition separating semantic and query-language preference failures. Evaluating 31 dense, sparse, and late-interaction retrievers, we show that standard metrics obscure distinct behaviors: semantically strong retrievers may return correct content in a non-query language, while retrievers with stronger query-language preference may retrieve less semantically relevant passages.