TruthRL:通过强化学习激励大语言模型说真话
TruthRL: Incentivizing Truthful LLMs via Reinforcement Learning
September 30, 2025
作者: Zhepei Wei, Xiao Yang, Kai Sun, Jiaqi Wang, Rulin Shao, Sean Chen, Mohammad Kachuee, Teja Gollapudi, Tony Liao, Nicolas Scheffer, Rakesh Wanga, Anuj Kumar, Yu Meng, Wen-tau Yih, Xin Luna Dong
cs.AI
摘要
尽管大型语言模型(LLMs)在事实性问答任务上展现了强大的性能,它们仍易产生幻觉和不真实的回答,尤其是在任务需求超出其参数化知识范围时。实际上,真实性不仅要求准确性——模型还需识别不确定性,并在不确定时选择弃答,以避免幻觉。这对现有方法构成了根本性挑战:追求准确性的方法往往会加剧幻觉,而鼓励弃答的方法则可能过于保守,牺牲正确答案。这两种极端最终都损害了真实性。在本研究中,我们提出了TruthRL,一个直接优化LLMs真实性的通用强化学习(RL)框架。具体而言,我们采用GRPO实现TruthRL,并设计了一种简单而有效的三元奖励机制,区分正确答案、幻觉和弃答。该机制不仅通过提供正确回答来激励模型减少幻觉,还允许模型在不确定时选择弃答,从而提升真实性。在四个知识密集型基准上的广泛实验表明,与基础RL相比,TruthRL显著减少了28.9%的幻觉,并提升了21.1%的真实性,在不同骨干模型(如Qwen、Llama)及检索与非检索设置下均取得了一致性提升。深入的消融研究显示,基于准确性的传统方法,如监督微调或使用二元奖励的RL,难以在事实正确性与不确定性之间取得平衡。相比之下,我们提出的以真实性为导向的TruthRL在准确性和真实性上均表现出色,强调了学习目标设计对于开发真实LLMs的重要性。
English
While large language models (LLMs) have demonstrated strong performance on
factoid question answering, they are still prone to hallucination and
untruthful responses, particularly when tasks demand information outside their
parametric knowledge. Indeed, truthfulness requires more than accuracy --
models must also recognize uncertainty and abstain when unsure to avoid
hallucinations. This presents a fundamental challenge for existing methods:
approaches that optimize for accuracy often amplify hallucinations, while those
that encourage abstention can become overly conservative, sacrificing correct
answers. Both extremes ultimately compromise truthfulness. In this work, we
present TruthRL, a general reinforcement learning (RL) framework that directly
optimizes the truthfulness of LLMs. Specifically, we implement TruthRL using
GRPO with a simple yet effective ternary reward that distinguishes correct
answers, hallucinations, and abstentions. It incentivizes models to reduce
hallucinations not only by providing correct responses, but also by enabling
abstention when uncertain, thereby improving truthfulness. Extensive
experiments across four knowledge-intensive benchmarks show that, compared to
vanilla RL, TruthRL significantly reduces hallucinations by 28.9% and improves
truthfulness by 21.1%, with consistent gains across various backbone models
(e.g., Qwen, Llama) under both retrieval and non-retrieval setups. In-depth
ablation study demonstrates that vanilla accuracy-driven methods, such as
supervised fine-tuning or RL with a binary reward, struggle to balance factual
correctness and uncertainty. In contrast, our proposed truthfulness-driven
TruthRL achieves strong performance in both accuracy and truthfulness,
underscoring the importance of learning objective design for developing
truthful LLMs.