LLM 기반 밀집 검색기의 강건성에 관한 연구: 일반화 성능과 안정성에 대한 체계적 분석

초록

디코더 전용 대규모 언어 모델(LLM)은 BERT 스타일 아키텍처를 대체하여 밀집 검색의 핵심 백본으로 자리 잡으며, 상당한 성능 향상과 폭넓은 적용을 이루고 있습니다. 그러나 이러한 LLM 기반 검색 모델의 강건성은 아직 충분히 연구되지 않았습니다. 본 논문에서는 최신 오픈소스 LLM 기반 밀집 검색 모델의 강건성을 일반화성과 안정성이라는 두 가지 상호 보완적인 관점에서 체계적으로 최초로 연구합니다. 일반화성 측면에서는 30개 데이터셋에 걸친 4개 벤치마크에서 검색 효과를 평가하고, 선형 혼합 효과 모델을 사용하여 한계 평균 성능을 추정하며 모델의 본질적 능력과 데이터셋 이질성을 분리합니다. 우리의 분석은 지시어 튜닝된 모델이 일반적으로 우수하지만, 복잡한 추론에 최적화된 모델은 종종 "전문화 비용"을 치르며 더 넓은 맥락에서 제한된 일반화성을 보인다는 것을 밝힙니다. 안정성 측면에서는 의도치 않은 질의 변형(예: 재구성, 오타)과 악의적 적대적 공격(예: 코퍼스 오염) 모두에 대한 모델 복원력을 평가합니다. LLM 기반 검색 모델은 인코더 전용 기준 모델 대비 오타와 코퍼스 오염에 대해 향상된 강건성을 보이지만, 동의어 치환과 같은 의미론적 교란에는 여전히 취약한 것으로 나타납니다. 추가 분석에 따르면 임베딩 기하학(예: 각도 균일성)은 어휘적 안정성에 대한 예측 신호를 제공하며, 모델 크기 확장이 일반적으로 강건성 향상으로 이어짐을 시사합니다. 이러한 결과는 향후 강건성 고려 검색 모델 설계와 체계적인 벤치마킹에 기여합니다. 우리의 코드는 https://github.com/liyongkang123/Robust_LLM_Retriever_Eval 에 공개되어 있습니다.

English

Decoder-only large language models (LLMs) are increasingly replacing BERT-style architectures as the backbone for dense retrieval, achieving substantial performance gains and broad adoption. However, the robustness of these LLM-based retrievers remains underexplored. In this paper, we present the first systematic study of the robustness of state-of-the-art open-source LLM-based dense retrievers from two complementary perspectives: generalizability and stability. For generalizability, we evaluate retrieval effectiveness across four benchmarks spanning 30 datasets, using linear mixed-effects models to estimate marginal mean performance and disentangle intrinsic model capability from dataset heterogeneity. Our analysis reveals that while instruction-tuned models generally excel, those optimized for complex reasoning often suffer a ``specialization tax,'' exhibiting limited generalizability in broader contexts. For stability, we assess model resilience against both unintentional query variations~(e.g., paraphrasing, typos) and malicious adversarial attacks~(e.g., corpus poisoning). We find that LLM-based retrievers show improved robustness against typos and corpus poisoning compared to encoder-only baselines, yet remain vulnerable to semantic perturbations like synonymizing. Further analysis shows that embedding geometry (e.g., angular uniformity) provides predictive signals for lexical stability and suggests that scaling model size generally improves robustness. These findings inform future robustness-aware retriever design and principled benchmarking. Our code is publicly available at https://github.com/liyongkang123/Robust_LLM_Retriever_Eval.

LLM 기반 밀집 검색기의 강건성에 관한 연구: 일반화 성능과 안정성에 대한 체계적 분석

On the Robustness of LLM-Based Dense Retrievers: A Systematic Analysis of Generalizability and Stability

초록

Support