일반적 접근법의 한계: 할루시네이션 완화를 위한 쿼리 밴딧

초록

대규모 언어 모델(LLM)의 고급 추론 능력은 환각 현상을 더 빈번하게 초래했으나, 대부분의 완화 연구는 오픈소스 모델에 대한 사후 탐지 및 매개변수 편집에 집중하고 있습니다. 기관 배포 모델의 절대 다수를 차지하는 폐쇄형 모델의 환각 현상에 대한 연구가 부족하다는 점은 특히 우려스러운 상황입니다. 본 논문은 실증적으로 검증 및 보정된 보상 함수를 활용하여 최적의 질의 재작성 전략을 온라인으로 적응적으로 학습하는 모델 불문 콘텍스츄얼 밴딧 프레임워크인 QueryBandits를 소개합니다. 16개의 질의응답 시나리오에서 최고 성능의 QueryBandits(톰슨 샘플링)은 재작성 없음 기준선 대비 87.5%의 승률을 기록했으며, 제로샷 정적 정책(예: 파라프레이즈 또는 확장)을 각각 42.6%, 60.3% outperformed 하였습니다. 또한 모든 콘텍스츄얼 밴딧은 모든 데이터셋에서 기본 밴딧을 outperformed 했으며, 특징 변동성이 높을수록 행동 선택의 변동성도 커졌습니다. 이는 모든 질의에 최적인 단일 재작성 정책이 없음을 입증합니다. 또한 특정 정적 정책이 재작성 없음 정책보다 더 높은 누적 후회를 초래한다는 점을 발견했는데, 이는 유연하지 않은 질의 재작성 정책이 오히려 환각 현상을 악화시킬 수 있음을 시사합니다. 따라서 QueryBandits를 통해 의미론적 특징에 대한 온라인 정책을 학습하면 순전히 순전파 메커니즘을 통해 모델 동작을 전환할 수 있어 폐쇄형 모델에서도 사용이 가능하며, 재학습이나 그래디언트 기반 적응이 필요 없습니다.

English

Advanced reasoning capabilities in Large Language Models (LLMs) have led to more frequent hallucinations; yet most mitigation work focuses on open-source models for post-hoc detection and parameter editing. The dearth of studies focusing on hallucinations in closed-source models is especially concerning, as they constitute the vast majority of models in institutional deployments. We introduce QueryBandits, a model-agnostic contextual bandit framework that adaptively learns online to select the optimal query-rewrite strategy by leveraging an empirically validated and calibrated reward function. Across 16 QA scenarios, our top QueryBandit (Thompson Sampling) achieves an 87.5% win rate over a No-Rewrite baseline and outperforms zero-shot static policies (e.g., Paraphrase or Expand) by 42.6% and 60.3%, respectively. Moreover, all contextual bandits outperform vanilla bandits across all datasets, with higher feature variance coinciding with greater variance in arm selection. This substantiates our finding that there is no single rewrite policy optimal for all queries. We also discover that certain static policies incur higher cumulative regret than No-Rewrite, indicating that an inflexible query-rewriting policy can worsen hallucinations. Thus, learning an online policy over semantic features with QueryBandits can shift model behavior purely through forward-pass mechanisms, enabling its use with closed-source models and bypassing the need for retraining or gradient-based adaptation.

일반적 접근법의 한계: 할루시네이션 완화를 위한 쿼리 밴딧

No One Size Fits All: QueryBandits for Hallucination Mitigation

초록

Support