Abstain-R1: 検証可能な強化学習による較正された棄権と拒否後の明確化

要旨

強化学習によるファインチューニングは大規模言語モデルの推論能力を向上させるが、未回答可能なクエリに対して推測や虚構情報の生成によって応答する傾向を助長する可能性もある。既存の保留手法は、一般的な拒否応答を生成するようモデルを訓練するか、重要な情報の欠落を特定できているか検証せずに追跡的な説明を促すものに留まっている。本研究では、意味が明確であるものの与えられた情報からは確実に解決できないクエリに着目し、信頼性の高いモデルは単に回答を保留するだけでなく、何が不足しているかを説明すべきであると論じる。我々は、回答可能なクエリでは正答を報酬としつつ、未回答可能なクエリでは明示的な保留と意味的に整合した拒否後説明を同時に最適化する、説明認識型RLVR報酬関数を提案する。この報酬を用いて訓練した3BパラメータモデルAbstain-R1は、回答可能なクエリでの高い性能を維持しつつ、未回答可能なクエリでの保留と説明を改善する。Abstain-Test、Abstain-QA、SelfAwareによる実験では、Abstain-R1がベースモデルを大幅に上回り、DeepSeek-R1を含む大規模システムと競合する未回答可能クエリへの対応を達成することが示された。これは、較正された保留と説明行動が単なる規模の拡大ではなく、検証可能な報酬を通じて学習可能であることを示唆している。

English

Reinforcement fine-tuning improves the reasoning ability of large language models, but it can also encourage them to answer unanswerable queries by guessing or hallucinating missing information. Existing abstention methods either train models to produce generic refusals or encourage follow-up clarifications without verifying whether those clarifications identify the key missing information. We study queries that are clear in meaning but cannot be reliably resolved from the given information, and argue that a reliable model should not only abstain, but also explain what is missing. We propose a clarification-aware RLVR reward that, while rewarding correct answers on answerable queries, jointly optimizes explicit abstention and semantically aligned post-refusal clarification on unanswerable queries. Using this reward, we train Abstain-R1, a 3B model that improves abstention and clarification on unanswerable queries while preserving strong performance on answerable ones. Experiments on Abstain-Test, Abstain-QA, and SelfAware show that Abstain-R1 substantially improves over its base model and achieves unanswerable-query behavior competitive with larger systems including DeepSeek-R1, suggesting that calibrated abstention and clarification can be learned through verifiable rewards rather than emerging from scale alone.

Abstain-R1: 検証可能な強化学習による較正された棄権と拒否後の明確化

Abstain-R1: Calibrated Abstention and Post-Refusal Clarification via Verifiable RL

要旨

Support