TruthRL: 強化学習による真実を重視する大規模言語モデルの促進

要旨

大規模言語モデル（LLM）は、事実に関する質問応答において高い性能を示しているものの、特にパラメトリック知識の範囲外の情報を必要とするタスクにおいて、幻覚（hallucination）や不正確な応答を生成しやすい傾向にある。実際、真実性（truthfulness）を確保するためには、正確さ（accuracy）だけでなく、モデルが不確実性を認識し、確信が持てない場合には回答を控える能力も必要である。これは既存の手法にとって根本的な課題を提示している：正確さを最適化するアプローチはしばしば幻覚を増幅し、一方で回答を控えることを促すアプローチは過度に保守的になり、正しい回答を犠牲にしてしまう。どちらの極端な場合も、最終的には真実性を損なうことになる。本研究では、LLMの真実性を直接最適化する汎用的な強化学習（RL）フレームワークであるTruthRLを提案する。具体的には、TruthRLをGRPOを用いて実装し、正しい回答、幻覚、および回答控えを区別するシンプルでありながら効果的な三値報酬を導入する。これにより、モデルは正しい回答を提供するだけでなく、不確実な場合には回答を控えることを促され、幻覚を減らすことで真実性を向上させる。4つの知識集約型ベンチマークでの大規模な実験により、TruthRLは従来のRLと比較して幻覚を28.9%削減し、真実性を21.1%向上させることが示された。また、様々なバックボーンモデル（例：Qwen、Llama）において、検索あり・なしの両設定で一貫した改善が見られた。詳細なアブレーションスタディにより、教師ありファインチューニングや二値報酬を用いたRLなど、正確さを重視する従来の手法は、事実の正確さと不確実性のバランスを取ることに苦戦することが明らかになった。一方、提案した真実性を重視するTruthRLは、正確さと真実性の両方で高い性能を達成し、真実性のあるLLMを開発するための学習目標設計の重要性を強調している。

English

While large language models (LLMs) have demonstrated strong performance on factoid question answering, they are still prone to hallucination and untruthful responses, particularly when tasks demand information outside their parametric knowledge. Indeed, truthfulness requires more than accuracy -- models must also recognize uncertainty and abstain when unsure to avoid hallucinations. This presents a fundamental challenge for existing methods: approaches that optimize for accuracy often amplify hallucinations, while those that encourage abstention can become overly conservative, sacrificing correct answers. Both extremes ultimately compromise truthfulness. In this work, we present TruthRL, a general reinforcement learning (RL) framework that directly optimizes the truthfulness of LLMs. Specifically, we implement TruthRL using GRPO with a simple yet effective ternary reward that distinguishes correct answers, hallucinations, and abstentions. It incentivizes models to reduce hallucinations not only by providing correct responses, but also by enabling abstention when uncertain, thereby improving truthfulness. Extensive experiments across four knowledge-intensive benchmarks show that, compared to vanilla RL, TruthRL significantly reduces hallucinations by 28.9% and improves truthfulness by 21.1%, with consistent gains across various backbone models (e.g., Qwen, Llama) under both retrieval and non-retrieval setups. In-depth ablation study demonstrates that vanilla accuracy-driven methods, such as supervised fine-tuning or RL with a binary reward, struggle to balance factual correctness and uncertainty. In contrast, our proposed truthfulness-driven TruthRL achieves strong performance in both accuracy and truthfulness, underscoring the importance of learning objective design for developing truthful LLMs.

TruthRL: 強化学習による真実を重視する大規模言語モデルの促進

TruthRL: Incentivizing Truthful LLMs via Reinforcement Learning

要旨

Support