ChatPaper.aiChatPaper

学习推理以实现事实性

Learning to Reason for Factuality

August 7, 2025
作者: Xilun Chen, Ilia Kulikov, Vincent-Pierre Berges, Barlas Oğuz, Rulin Shao, Gargi Ghosh, Jason Weston, Wen-tau Yih
cs.AI

摘要

推理型大语言模型(R-LLMs)在复杂推理任务上取得了显著进展,但在长文本事实性基准测试中,相较于非推理型模型,其产生的幻觉现象明显增多。然而,将在线强化学习(RL)——近期R-LLM进步的关键组成部分——扩展到长文本事实性场景,因缺乏可靠的验证方法而面临诸多独特挑战。先前研究已利用如FActScore等自动事实性评估框架在离线RL环境中筛选偏好数据,但我们发现,直接将此类方法作为在线RL的奖励会导致多种形式的奖励欺骗,例如生成不够详细或相关性低的回答。我们提出了一种新颖的奖励函数,它同时考量事实精确度、回答详细程度及答案相关性,并应用在线RL来学习高质量的事实推理。在六个长文本事实性基准测试上的评估显示,我们的事实推理模型平均将幻觉率降低了23.1个百分点,回答详细程度提升了23%,且整体回答的有用性未出现下降。
English
Reasoning Large Language Models (R-LLMs) have significantly advanced complex reasoning tasks but often struggle with factuality, generating substantially more hallucinations than their non-reasoning counterparts on long-form factuality benchmarks. However, extending online Reinforcement Learning (RL), a key component in recent R-LLM advancements, to the long-form factuality setting poses several unique challenges due to the lack of reliable verification methods. Previous work has utilized automatic factuality evaluation frameworks such as FActScore to curate preference data in the offline RL setting, yet we find that directly leveraging such methods as the reward in online RL leads to reward hacking in multiple ways, such as producing less detailed or relevant responses. We propose a novel reward function that simultaneously considers the factual precision, response detail level, and answer relevance, and applies online RL to learn high quality factual reasoning. Evaluated on six long-form factuality benchmarks, our factual reasoning model achieves an average reduction of 23.1 percentage points in hallucination rate, a 23% increase in answer detail level, and no degradation in the overall response helpfulness.
PDF22August 8, 2025