LLMベースの高密度検索モデルの頑健性に関する研究：一般化性能と安定性の体系的分析

要旨

デコーダのみの大規模言語モデル（LLM）は、高密度検索の基盤としてBERTスタイルのアーキテクチャに取って代わりつつあり、大幅な性能向上と広範な採用を実現している。しかし、こうしたLLMベースの検索モデルの頑健性は未だ十分に検討されていない。本論文では、最先端のオープンソースLLMベース高密度検索モデルの頑健性を、一般性と安定性という二つの相補的視点から初めて体系的に検証する。一般性については、30のデータセットに跨る4つのベンチマークで検索効果を評価し、線形混合効果モデルを用いて限界平均性能を推定し、内在的モデル能力とデータセットの不均質性を分離する。分析の結果、指示チューニングされたモデルは一般的に優れるものの、複雑な推論に最適化されたモデルは「特化の代償」を受けやすく、より広範な文脈では一般性が限定されることが明らかになった。安定性については、意図的でないクエリ変動（言い換え、誤字）と悪意のある敵対的攻撃（コーパス汚染）の両方に対するモデルの耐性を評価する。LLMベース検索モデルは、エンコーダのみのベースラインと比較して誤字やコーパス汚染に対する頑健性が向上しているものの、同義語置換などの意味的摂動には依然脆弱であることが分かった。さらに分析を進めると、埋め込み幾何学（角度均一性など）が語彙的安定性の予測信号を提供し、モデルサイズの拡大が一般に頑健性向上につながることが示唆された。これらの知見は、将来の頑健性を考慮した検索モデル設計と原理に基づくベンチマーク構築に貢献する。コードはhttps://github.com/liyongkang123/Robust_LLM_Retriever_Eval で公開されている。

English

Decoder-only large language models (LLMs) are increasingly replacing BERT-style architectures as the backbone for dense retrieval, achieving substantial performance gains and broad adoption. However, the robustness of these LLM-based retrievers remains underexplored. In this paper, we present the first systematic study of the robustness of state-of-the-art open-source LLM-based dense retrievers from two complementary perspectives: generalizability and stability. For generalizability, we evaluate retrieval effectiveness across four benchmarks spanning 30 datasets, using linear mixed-effects models to estimate marginal mean performance and disentangle intrinsic model capability from dataset heterogeneity. Our analysis reveals that while instruction-tuned models generally excel, those optimized for complex reasoning often suffer a ``specialization tax,'' exhibiting limited generalizability in broader contexts. For stability, we assess model resilience against both unintentional query variations~(e.g., paraphrasing, typos) and malicious adversarial attacks~(e.g., corpus poisoning). We find that LLM-based retrievers show improved robustness against typos and corpus poisoning compared to encoder-only baselines, yet remain vulnerable to semantic perturbations like synonymizing. Further analysis shows that embedding geometry (e.g., angular uniformity) provides predictive signals for lexical stability and suggests that scaling model size generally improves robustness. These findings inform future robustness-aware retriever design and principled benchmarking. Our code is publicly available at https://github.com/liyongkang123/Robust_LLM_Retriever_Eval.

LLMベースの高密度検索モデルの頑健性に関する研究：一般化性能と安定性の体系的分析

On the Robustness of LLM-Based Dense Retrievers: A Systematic Analysis of Generalizability and Stability

要旨

Support