没有万能良方：基于查询的智能体应对幻觉缓解

摘要

大型语言模型（LLM）的先进推理能力导致幻觉现象愈发频繁，然而现有缓解方法多集中于开源模型的事后检测与参数编辑。由于闭源模型在机构部署中占据绝大多数，针对其幻觉问题的研究匮乏尤为值得关注。我们提出QueryBandits——一个模型无关的上下文赌博框架，该框架通过经验验证且校准的奖励函数，自适应地在线学习选择最优查询重写策略。在16个问答场景中，顶级QueryBandit（汤普森采样）相比无重写基线达到87.5%的胜率，并分别以42.6%和60.3%的优势超越零样本静态策略（如复述或扩展）。此外，所有上下文赌博算法在所有数据集上均优于基础赌博算法，特征方差越大，臂选择方差也越大。这证实了不存在适用于所有查询的最优重写策略。我们还发现某些静态策略比无重写策略产生更高累积遗憾，表明僵化的查询重写策略可能加剧幻觉。因此，通过QueryBandits基于语义特征学习在线策略，可仅通过前向传播机制改变模型行为，使其适用于闭源模型，并规避重新训练或基于梯度的适配需求。

English

Advanced reasoning capabilities in Large Language Models (LLMs) have led to more frequent hallucinations; yet most mitigation work focuses on open-source models for post-hoc detection and parameter editing. The dearth of studies focusing on hallucinations in closed-source models is especially concerning, as they constitute the vast majority of models in institutional deployments. We introduce QueryBandits, a model-agnostic contextual bandit framework that adaptively learns online to select the optimal query-rewrite strategy by leveraging an empirically validated and calibrated reward function. Across 16 QA scenarios, our top QueryBandit (Thompson Sampling) achieves an 87.5% win rate over a No-Rewrite baseline and outperforms zero-shot static policies (e.g., Paraphrase or Expand) by 42.6% and 60.3%, respectively. Moreover, all contextual bandits outperform vanilla bandits across all datasets, with higher feature variance coinciding with greater variance in arm selection. This substantiates our finding that there is no single rewrite policy optimal for all queries. We also discover that certain static policies incur higher cumulative regret than No-Rewrite, indicating that an inflexible query-rewriting policy can worsen hallucinations. Thus, learning an online policy over semantic features with QueryBandits can shift model behavior purely through forward-pass mechanisms, enabling its use with closed-source models and bypassing the need for retraining or gradient-based adaptation.