Abstain-R1: 검증 가능한 강화학습을 통한 보정된 기권 및 거부 후 명확화

초록

강화 미세 조정은 대규모 언어 모델의 추론 능력을 향상시키지만, 답변이 불가능한 질의에 대해 추측이나 누락된 정보의 환각을 통해 응답하도록 유도할 수도 있습니다. 기존의 회피 방법들은 모델이 일반적인 거절 응답을 생성하도록 훈련시키거나, 해당 설명이 핵심 누락 정보를 식별하는지 검증 없이 후속 설명을 권장하는 방식에 그칩니다. 본 연구는 의미는 명확하지만 주어진 정보만으로는 신뢰성 있게 해결될 수 없는 질의를 분석하며, 신뢰할 수 있는 모델은 단순히 회피할 뿐만 아니라 무엇이 누락되었는지 설명해야 한다고 주장합니다. 우리는 응답 가능한 질의에서는 정확한 답변을 보상하면서, 응답 불가능한 질의에서는 명시적 회피와 의미적으로 일치하는 거절 후 설명을 함께 최적화하는 설명 인식 RLVR 보상 방식을 제안합니다. 이 보상을 활용하여 훈련된 3B 규모 모델인 Abstain-R1은 응답 가능한 질의에서의 강력한 성능을 유지하면서 응답 불가능한 질의에 대한 회피 및 설명 능력을 향상시킵니다. Abstain-Test, Abstain-QA 및 SelfAware에 대한 실험 결과, Abstain-R1은 기본 모델 대비 크게 개선되었으며 DeepSeek-R1을 포함한 더 큰 시스템과 경쟁 가능한 응답 불가능 질의 처리 성능을 달성하여, 보정된 회피와 설명이 단순한 규모 확대가 아닌 검증 가능한 보상을 통해 학습될 수 있음을 시사합니다.

English

Reinforcement fine-tuning improves the reasoning ability of large language models, but it can also encourage them to answer unanswerable queries by guessing or hallucinating missing information. Existing abstention methods either train models to produce generic refusals or encourage follow-up clarifications without verifying whether those clarifications identify the key missing information. We study queries that are clear in meaning but cannot be reliably resolved from the given information, and argue that a reliable model should not only abstain, but also explain what is missing. We propose a clarification-aware RLVR reward that, while rewarding correct answers on answerable queries, jointly optimizes explicit abstention and semantically aligned post-refusal clarification on unanswerable queries. Using this reward, we train Abstain-R1, a 3B model that improves abstention and clarification on unanswerable queries while preserving strong performance on answerable ones. Experiments on Abstain-Test, Abstain-QA, and SelfAware show that Abstain-R1 substantially improves over its base model and achieves unanswerable-query behavior competitive with larger systems including DeepSeek-R1, suggesting that calibrated abstention and clarification can be learned through verifiable rewards rather than emerging from scale alone.

Abstain-R1: 검증 가능한 강화학습을 통한 보정된 기권 및 거부 후 명확화

Abstain-R1: Calibrated Abstention and Post-Refusal Clarification via Verifiable RL

초록

Support