ChatPaper.aiChatPaper

基于大模型的稠密检索器鲁棒性研究:泛化能力与稳定性的系统分析

On the Robustness of LLM-Based Dense Retrievers: A Systematic Analysis of Generalizability and Stability

April 17, 2026
作者: Yongkang Li, Panagiotis Eustratiadis, Yixing Fan, Evangelos Kanoulas
cs.AI

摘要

仅解码器架构的大语言模型正日益取代BERT式架构成为稠密检索的核心,在实现显著性能提升和广泛采用的同时,其鲁棒性研究仍存空白。本文首次从泛化性与稳定性两个互补视角,对基于前沿开源大语言的稠密检索模型进行系统性鲁棒性研究。在泛化性方面,我们通过涵盖30个数据集的四个基准测试评估检索效果,采用线性混合效应模型估算边际平均性能,以区分模型内在能力与数据集异质性。分析表明:虽然指令微调模型总体表现优异,但针对复杂推理优化的模型往往承受"专业化代价",在更广泛场景中泛化能力有限。在稳定性方面,我们评估模型对非意图性查询变异(如转述、拼写错误)和恶意对抗攻击(如语料库投毒)的抵御能力。研究发现:与仅编码器基线相比,基于大语言的检索模型对拼写错误和语料库投毒展现出更强鲁棒性,但对同义替换等语义扰动仍显脆弱。进一步分析揭示:嵌入几何特征(如角度均匀性)可为词汇稳定性提供预测信号,且模型规模扩展通常能提升鲁棒性。这些发现为未来鲁棒性感知的检索器设计和原则性基准测试提供了指导。代码已开源:https://github.com/liyongkang123/Robust_LLM_Retriever_Eval。
English
Decoder-only large language models (LLMs) are increasingly replacing BERT-style architectures as the backbone for dense retrieval, achieving substantial performance gains and broad adoption. However, the robustness of these LLM-based retrievers remains underexplored. In this paper, we present the first systematic study of the robustness of state-of-the-art open-source LLM-based dense retrievers from two complementary perspectives: generalizability and stability. For generalizability, we evaluate retrieval effectiveness across four benchmarks spanning 30 datasets, using linear mixed-effects models to estimate marginal mean performance and disentangle intrinsic model capability from dataset heterogeneity. Our analysis reveals that while instruction-tuned models generally excel, those optimized for complex reasoning often suffer a ``specialization tax,'' exhibiting limited generalizability in broader contexts. For stability, we assess model resilience against both unintentional query variations~(e.g., paraphrasing, typos) and malicious adversarial attacks~(e.g., corpus poisoning). We find that LLM-based retrievers show improved robustness against typos and corpus poisoning compared to encoder-only baselines, yet remain vulnerable to semantic perturbations like synonymizing. Further analysis shows that embedding geometry (e.g., angular uniformity) provides predictive signals for lexical stability and suggests that scaling model size generally improves robustness. These findings inform future robustness-aware retriever design and principled benchmarking. Our code is publicly available at https://github.com/liyongkang123/Robust_LLM_Retriever_Eval.
PDF02April 22, 2026