ChatPaper.aiChatPaper

基于大语言模型的稠密检索器鲁棒性研究:泛化能力与稳定性的系统分析

On the Robustness of LLM-Based Dense Retrievers: A Systematic Analysis of Generalizability and Stability

April 17, 2026
作者: Yongkang Li, Panagiotis Eustratiadis, Yixing Fan, Evangelos Kanoulas
cs.AI

摘要

仅解码器架构的大语言模型正日益取代BERT式架构,成为稠密检索的骨干网络,在实现显著性能提升的同时获得了广泛应用。然而,这类基于大语言模型的检索器其鲁棒性仍待深入探索。本文首次从泛化性与稳定性两个互补视角,对最先进的开源大语言模型稠密检索器进行了系统性鲁棒性研究。在泛化性方面,我们通过涵盖30个数据集的四大基准测试评估检索效果,并采用线性混合效应模型估算边际平均性能,以区分模型内在能力与数据集异质性。分析表明:虽然指令微调模型整体表现优异,但针对复杂推理优化的模型往往需支付"专业化代价",在更广泛场景中呈现有限的泛化能力。在稳定性方面,我们评估了模型对非故意查询变异(如转述、拼写错误)和恶意对抗攻击(如语料库投毒)的抵御能力。研究发现:相较于仅编码器基线,基于大语言模型的检索器对拼写错误和语料库投毒表现出更强的鲁棒性,但仍易受同义词替换等语义扰动影响。进一步分析表明,嵌入几何特征(如角度均匀性)可为词汇稳定性提供预测信号,且扩大模型规模通常能提升鲁棒性。这些发现为未来鲁棒性感知的检索器设计和原则性基准测试提供了重要参考。我们的代码已开源:https://github.com/liyongkang123/Robust_LLM_Retriever_Eval。
English
Decoder-only large language models (LLMs) are increasingly replacing BERT-style architectures as the backbone for dense retrieval, achieving substantial performance gains and broad adoption. However, the robustness of these LLM-based retrievers remains underexplored. In this paper, we present the first systematic study of the robustness of state-of-the-art open-source LLM-based dense retrievers from two complementary perspectives: generalizability and stability. For generalizability, we evaluate retrieval effectiveness across four benchmarks spanning 30 datasets, using linear mixed-effects models to estimate marginal mean performance and disentangle intrinsic model capability from dataset heterogeneity. Our analysis reveals that while instruction-tuned models generally excel, those optimized for complex reasoning often suffer a ``specialization tax,'' exhibiting limited generalizability in broader contexts. For stability, we assess model resilience against both unintentional query variations~(e.g., paraphrasing, typos) and malicious adversarial attacks~(e.g., corpus poisoning). We find that LLM-based retrievers show improved robustness against typos and corpus poisoning compared to encoder-only baselines, yet remain vulnerable to semantic perturbations like synonymizing. Further analysis shows that embedding geometry (e.g., angular uniformity) provides predictive signals for lexical stability and suggests that scaling model size generally improves robustness. These findings inform future robustness-aware retriever design and principled benchmarking. Our code is publicly available at https://github.com/liyongkang123/Robust_LLM_Retriever_Eval.
PDF02April 22, 2026