Abstain-R1：基于可验证强化学习的校准弃权与拒答后澄清机制

摘要

強化微調雖然能提升大型語言模型的推理能力，但也可能導致模型通過猜測或虛構缺失信息來回應本不可回答的查詢。現有的拒答方法要麼訓練模型生成通用拒絕回應，要麼鼓勵其尋求後續澄清卻未驗證這些澄清是否真正定位到關鍵缺失信息。本文針對那些含義明確但無法根據現有信息可靠解答的查詢展開研究，主張可靠的模型不僅應具備拒答能力，還需解釋缺失何種信息。我們提出一種澄清感知的RLVR獎勵機制，在獎勵可答查詢正確回應的同時，聯合優化模型對不可答查詢的顯式拒答行為與語義對齊的拒答後澄清能力。基於此獎勵機制訓練的3B參數模型Abstain-R1，在保持可答查詢強勁表現的同時，顯著提升了對不可答查詢的拒答與澄清能力。在Abstain-Test、Abstain-QA和SelfAware數據集上的實驗表明，Abstain-R1相較其基礎模型實現質的飛躍，並在不可答查詢處理行為上可與DeepSeek-R1等更大規模系統競爭，這表明經過校準的拒答與澄清能力可通過可驗證獎勵機制習得，而非僅依賴模型規模擴張。

English

Reinforcement fine-tuning improves the reasoning ability of large language models, but it can also encourage them to answer unanswerable queries by guessing or hallucinating missing information. Existing abstention methods either train models to produce generic refusals or encourage follow-up clarifications without verifying whether those clarifications identify the key missing information. We study queries that are clear in meaning but cannot be reliably resolved from the given information, and argue that a reliable model should not only abstain, but also explain what is missing. We propose a clarification-aware RLVR reward that, while rewarding correct answers on answerable queries, jointly optimizes explicit abstention and semantically aligned post-refusal clarification on unanswerable queries. Using this reward, we train Abstain-R1, a 3B model that improves abstention and clarification on unanswerable queries while preserving strong performance on answerable ones. Experiments on Abstain-Test, Abstain-QA, and SelfAware show that Abstain-R1 substantially improves over its base model and achieves unanswerable-query behavior competitive with larger systems including DeepSeek-R1, suggesting that calibrated abstention and clarification can be learned through verifiable rewards rather than emerging from scale alone.

Abstain-R1：基于可验证强化学习的校准弃权与拒答后澄清机制

Abstain-R1: Calibrated Abstention and Post-Refusal Clarification via Verifiable RL

摘要

Support