ChatPaper.aiChatPaper

學習推理以確保事實性

Learning to Reason for Factuality

August 7, 2025
作者: Xilun Chen, Ilia Kulikov, Vincent-Pierre Berges, Barlas Oğuz, Rulin Shao, Gargi Ghosh, Jason Weston, Wen-tau Yih
cs.AI

摘要

推理型大型語言模型(R-LLMs)在複雜推理任務上取得了顯著進展,但在長篇事實性基準測試中,其產生的幻覺遠多於非推理型模型。然而,將線上強化學習(RL)——近期R-LLM進步的關鍵組成部分——擴展到長篇事實性設定中,由於缺乏可靠的驗證方法,面臨著多項獨特挑戰。先前的研究已利用如FActScore等自動事實性評估框架來在離線RL設定中整理偏好數據,但我們發現,直接在線上RL中將此類方法用作獎勵會導致獎勵欺騙,例如產生較不詳細或相關的回應。我們提出了一種新穎的獎勵函數,它同時考慮了事實精確度、回應詳細程度和答案相關性,並應用線上RL來學習高質量的事實推理。在六個長篇事實性基準測試中,我們的事實推理模型平均將幻覺率降低了23.1個百分點,答案詳細程度提升了23%,且整體回應的有用性未見下降。
English
Reasoning Large Language Models (R-LLMs) have significantly advanced complex reasoning tasks but often struggle with factuality, generating substantially more hallucinations than their non-reasoning counterparts on long-form factuality benchmarks. However, extending online Reinforcement Learning (RL), a key component in recent R-LLM advancements, to the long-form factuality setting poses several unique challenges due to the lack of reliable verification methods. Previous work has utilized automatic factuality evaluation frameworks such as FActScore to curate preference data in the offline RL setting, yet we find that directly leveraging such methods as the reward in online RL leads to reward hacking in multiple ways, such as producing less detailed or relevant responses. We propose a novel reward function that simultaneously considers the factual precision, response detail level, and answer relevance, and applies online RL to learn high quality factual reasoning. Evaluated on six long-form factuality benchmarks, our factual reasoning model achieves an average reduction of 23.1 percentage points in hallucination rate, a 23% increase in answer detail level, and no degradation in the overall response helpfulness.
PDF32August 8, 2025