강화 미세 조정의 환각 과세

초록

강화 미세조정(Reinforcement Finetuning, RFT)은 대규모 언어 모델(LLMs)의 추론 능력을 향상시키기 위한 표준 접근법으로 자리 잡았습니다. 그러나 이 기법이 모델의 신뢰성에 미치는 영향은 아직 충분히 연구되지 않았습니다. 본 연구에서는 RFT의 중요한 부작용을 식별하고 체계적으로 연구하며, 이를 '환각 세금(hallucination tax)'이라고 명명했습니다. 이는 모델이 답변할 수 없는 질문에 대해 자신 있게 환각적인 답변을 생성하도록 하는 거부 행동의 저하를 의미합니다. 이를 조사하기 위해, 우리는 SUM(Synthetic Unanswerable Math)이라는 고품질의 데이터셋을 도입했습니다. 이 데이터셋은 불충분하거나 모호한 정보로부터 추론하여 답변할 수 없는 질문을 인식하는 모델의 능력을 탐구하기 위해 설계된 답변 불가능한 수학 문제들로 구성되어 있습니다. 우리의 실험 결과, 표준 RFT 훈련은 모델의 거부율을 80% 이상 감소시켜 모델의 환각 경향성을 크게 증가시키는 것으로 나타났습니다. 또한, RFT 과정에서 SUM 데이터셋의 10%만 포함시켜도 적절한 거부 행동이 상당 부분 회복되며, 해결 가능한 작업에서의 정확도 손실은 최소화되는 것을 확인했습니다. 무엇보다도, 이 접근법은 LLMs가 추론 시 계산 자원을 활용하여 자신의 불확실성과 지식의 한계를 추론할 수 있게 함으로써, 도메인 외 수학 문제뿐만 아니라 사실 기반 질의응답 작업에서도 일반화 능력을 향상시킬 수 있음을 보여줍니다.

English

Reinforcement finetuning (RFT) has become a standard approach for enhancing the reasoning capabilities of large language models (LLMs). However, its impact on model trustworthiness remains underexplored. In this work, we identify and systematically study a critical side effect of RFT, which we term the hallucination tax: a degradation in refusal behavior causing models to produce hallucinated answers to unanswerable questions confidently. To investigate this, we introduce SUM (Synthetic Unanswerable Math), a high-quality dataset of unanswerable math problems designed to probe models' ability to recognize an unanswerable question by reasoning from the insufficient or ambiguous information. Our results show that standard RFT training could reduce model refusal rates by more than 80%, which significantly increases model's tendency to hallucinate. We further demonstrate that incorporating just 10% SUM during RFT substantially restores appropriate refusal behavior, with minimal accuracy trade-offs on solvable tasks. Crucially, this approach enables LLMs to leverage inference-time compute to reason about their own uncertainty and knowledge boundaries, improving generalization not only to out-of-domain math problems but also to factual question answering tasks.

강화 미세 조정의 환각 과세

The Hallucination Tax of Reinforcement Finetuning

초록

Support