情報検索における意味的近接性の改善に向けた言語横断的アライメント手法の提案

要旨

多言語文書の利用可能性が高まり、活用が進むにつれて、異なる言語間情報検索（CLIR）は重要な研究分野として浮上している。従来、CLIRタスクは、文書の言語とクエリの言語が異なる設定で実施され、通常、文書は単一の首尾一貫した言語で構成されている。本論文では、このような設定では、言語間の対応付け能力が適切に評価されない可能性があることを指摘する。具体的には、英語文書と別の言語の文書が混在する文書プールにおいて、大多数の多言語検索モデルが、クエリと同じ言語で書かれた関連文書よりも、無関係な英語文書を優先する傾向があることを観察した。この現象を厳密に分析し定量化するため、多言語検索モデルの言語間対応付け性能を評価する様々なシナリオと指標を導入する。さらに、この困難な条件下での言語間性能を改善するために、言語間の対応付けを強化する新しい訓練戦略を提案する。2.8kサンプルという少量のデータセットのみを使用して、我々の手法は言語間検索性能を大幅に改善すると同時に、英語偏向問題を軽減する。詳細な分析により、提案手法が大多数の多言語埋め込みモデルの言語間対応付け能力を大幅に向上させることを実証する。

English

With the increasing accessibility and utilization of multilingual documents, Cross-Lingual Information Retrieval (CLIR) has emerged as an important research area. Conventionally, CLIR tasks have been conducted under settings where the language of documents differs from that of queries, and typically, the documents are composed in a single coherent language. In this paper, we highlight that in such a setting, the cross-lingual alignment capability may not be evaluated adequately. Specifically, we observe that, in a document pool where English documents coexist with another language, most multilingual retrievers tend to prioritize unrelated English documents over the related document written in the same language as the query. To rigorously analyze and quantify this phenomenon, we introduce various scenarios and metrics designed to evaluate the cross-lingual alignment performance of multilingual retrieval models. Furthermore, to improve cross-lingual performance under these challenging conditions, we propose a novel training strategy aimed at enhancing cross-lingual alignment. Using only a small dataset consisting of 2.8k samples, our method significantly improves the cross-lingual retrieval performance while simultaneously mitigating the English inclination problem. Extensive analyses demonstrate that the proposed method substantially enhances the cross-lingual alignment capabilities of most multilingual embedding models.

情報検索における意味的近接性の改善に向けた言語横断的アライメント手法の提案

Improving Semantic Proximity in Information Retrieval through Cross-Lingual Alignment

要旨

Support