TruthRL：通過強化學習激勵真實性的大型語言模型

摘要

儘管大型語言模型（LLMs）在事實性問答任務上展現了強大的性能，它們仍容易產生幻覺和不真實的回應，尤其是在任務要求超出其參數化知識範圍時。事實上，真實性不僅需要準確性——模型還必須能夠識別不確定性，並在無法確定時選擇棄答，以避免幻覺。這對現有方法提出了根本性的挑戰：優化準確性的方法往往會放大幻覺，而鼓勵棄答的方法則可能變得過於保守，犧牲正確答案。這兩種極端最終都損害了真實性。在本研究中，我們提出了TruthRL，這是一個直接優化LLMs真實性的通用強化學習（RL）框架。具體而言，我們使用GRPO實現了TruthRL，並設計了一種簡單而有效的三元獎勵機制，區分正確答案、幻覺和棄答。它不僅通過提供正確回應來激勵模型減少幻覺，還允許模型在不確定時棄答，從而提升真實性。在四個知識密集型基準上的廣泛實驗表明，與普通RL相比，TruthRL顯著減少了28.9%的幻覺，並提升了21.1%的真實性，在各種骨幹模型（如Qwen、Llama）下，無論是否結合檢索，均表現出穩定的增益。深入的消融研究顯示，以準確性為導向的傳統方法，如監督微調或使用二元獎勵的RL，難以在事實正確性和不確定性之間取得平衡。相比之下，我們提出的以真實性為導向的TruthRL在準確性和真實性上均表現出色，凸顯了學習目標設計對於開發真實LLMs的重要性。

English

While large language models (LLMs) have demonstrated strong performance on factoid question answering, they are still prone to hallucination and untruthful responses, particularly when tasks demand information outside their parametric knowledge. Indeed, truthfulness requires more than accuracy -- models must also recognize uncertainty and abstain when unsure to avoid hallucinations. This presents a fundamental challenge for existing methods: approaches that optimize for accuracy often amplify hallucinations, while those that encourage abstention can become overly conservative, sacrificing correct answers. Both extremes ultimately compromise truthfulness. In this work, we present TruthRL, a general reinforcement learning (RL) framework that directly optimizes the truthfulness of LLMs. Specifically, we implement TruthRL using GRPO with a simple yet effective ternary reward that distinguishes correct answers, hallucinations, and abstentions. It incentivizes models to reduce hallucinations not only by providing correct responses, but also by enabling abstention when uncertain, thereby improving truthfulness. Extensive experiments across four knowledge-intensive benchmarks show that, compared to vanilla RL, TruthRL significantly reduces hallucinations by 28.9% and improves truthfulness by 21.1%, with consistent gains across various backbone models (e.g., Qwen, Llama) under both retrieval and non-retrieval setups. In-depth ablation study demonstrates that vanilla accuracy-driven methods, such as supervised fine-tuning or RL with a binary reward, struggle to balance factual correctness and uncertainty. In contrast, our proposed truthfulness-driven TruthRL achieves strong performance in both accuracy and truthfulness, underscoring the importance of learning objective design for developing truthful LLMs.

TruthRL：通過強化學習激勵真實性的大型語言模型

TruthRL: Incentivizing Truthful LLMs via Reinforcement Learning

摘要

Support