事実性を推論するための学習

要旨

推論型大規模言語モデル（R-LLMs）は、複雑な推論タスクにおいて大幅な進展を遂げているが、事実性に関しては課題が残り、長文の事実性ベンチマークにおいて非推論型モデルよりもはるかに多くの虚偽生成（ハルシネーション）を引き起こすことが多い。しかし、最近のR-LLMの進展において重要な要素であるオンライン強化学習（RL）を、長文の事実性設定に拡張することは、信頼性のある検証方法の欠如により、いくつかの独自の課題を引き起こす。これまでの研究では、FActScoreなどの自動的な事実性評価フレームワークを利用して、オフラインRL設定における選好データを整備してきたが、そのような方法をオンラインRLの報酬として直接活用すると、詳細性や関連性の低い回答を生成するなど、複数の方法で報酬ハッキングが発生することがわかった。本研究では、事実の精度、回答の詳細レベル、および回答の関連性を同時に考慮する新しい報酬関数を提案し、オンラインRLを適用して高品質な事実推論を学習する。6つの長文事実性ベンチマークで評価した結果、提案した事実推論モデルは、ハルシネーション率を平均23.1ポイント削減し、回答の詳細レベルを23％向上させ、全体的な回答の有用性に劣化が見られないことを実証した。

English

Reasoning Large Language Models (R-LLMs) have significantly advanced complex reasoning tasks but often struggle with factuality, generating substantially more hallucinations than their non-reasoning counterparts on long-form factuality benchmarks. However, extending online Reinforcement Learning (RL), a key component in recent R-LLM advancements, to the long-form factuality setting poses several unique challenges due to the lack of reliable verification methods. Previous work has utilized automatic factuality evaluation frameworks such as FActScore to curate preference data in the offline RL setting, yet we find that directly leveraging such methods as the reward in online RL leads to reward hacking in multiple ways, such as producing less detailed or relevant responses. We propose a novel reward function that simultaneously considers the factual precision, response detail level, and answer relevance, and applies online RL to learn high quality factual reasoning. Evaluated on six long-form factuality benchmarks, our factual reasoning model achieves an average reduction of 23.1 percentage points in hallucination rate, a 23% increase in answer detail level, and no degradation in the overall response helpfulness.

事実性を推論するための学習

Learning to Reason for Factuality

要旨

Support