幻覚スパン検出のための推論学習

要旨

大規模言語モデル（LLMs）はしばしば、信頼性を損なう根拠のない内容、すなわち「幻覚」を生成する。従来の研究の多くは幻覚検出を二値タスクとして扱ってきたが、現実世界の多くのアプリケーションでは、幻覚が発生した範囲を特定する必要があり、これは多段階の意思決定プロセスである。このことから、明示的な推論が幻覚範囲の検出という複雑なタスクに役立つかどうかという疑問が自然に生じる。この疑問に答えるため、我々はまず、Chain-of-Thought（CoT）推論を適用した場合と適用しない場合の事前学習モデルを評価し、CoT推論が複数回サンプリングされた際に少なくとも1つの正しい答えを生成する可能性があることを示す。この結果を踏まえ、我々はRL4HSを提案する。これは、範囲レベルでの報酬関数を用いて推論を促進する強化学習フレームワークである。RL4HSはGroup Relative Policy Optimizationを基盤とし、報酬の不均衡問題を緩和するためにClass-Aware Policy Optimizationを導入する。RAGTruthベンチマーク（要約、質問応答、データからテキストへの変換）での実験により、RL4HSが事前学習された推論モデルや教師ありファインチューニングを上回り、幻覚範囲の検出において範囲レベルでの報酬を用いた強化学習の必要性が実証された。

English

Large language models (LLMs) often generate hallucinations -- unsupported content that undermines reliability. While most prior works frame hallucination detection as a binary task, many real-world applications require identifying hallucinated spans, which is a multi-step decision making process. This naturally raises the question of whether explicit reasoning can help the complex task of detecting hallucination spans. To answer this question, we first evaluate pretrained models with and without Chain-of-Thought (CoT) reasoning, and show that CoT reasoning has the potential to generate at least one correct answer when sampled multiple times. Motivated by this, we propose RL4HS, a reinforcement learning framework that incentivizes reasoning with a span-level reward function. RL4HS builds on Group Relative Policy Optimization and introduces Class-Aware Policy Optimization to mitigate reward imbalance issue. Experiments on the RAGTruth benchmark (summarization, question answering, data-to-text) show that RL4HS surpasses pretrained reasoning models and supervised fine-tuning, demonstrating the necessity of reinforcement learning with span-level rewards for detecting hallucination spans.

幻覚スパン検出のための推論学習

Learning to Reason for Hallucination Span Detection

要旨

Support