強化微調的幻覺代價

摘要

強化微調（Reinforcement Finetuning, RFT）已成為提升大型語言模型（LLMs）推理能力的標準方法。然而，其對模型可信度的影響仍未被充分探討。在本研究中，我們識別並系統性地研究了RFT的一個關鍵副作用，我們稱之為「幻覺稅」：這導致模型在面對無法回答的問題時，拒絕行為退化，從而自信地產生虛構的答案。為探究此現象，我們引入了SUM（Synthetic Unanswerable Math），這是一個高品質的不可解答數學問題數據集，旨在通過從不充分或模糊的信息中推理，來測試模型識別不可回答問題的能力。我們的結果顯示，標準的RFT訓練可能使模型拒絕率降低超過80%，這顯著增加了模型產生幻覺的傾向。我們進一步證明，在RFT過程中僅加入10%的SUM數據，即可大幅恢復適當的拒絕行為，且對可解答任務的準確性影響最小。關鍵在於，這種方法使LLMs能夠利用推理時的計算資源來思考其自身的不確定性和知識邊界，不僅提升了對域外數學問題的泛化能力，也改善了事實性問答任務的表現。

English

Reinforcement finetuning (RFT) has become a standard approach for enhancing the reasoning capabilities of large language models (LLMs). However, its impact on model trustworthiness remains underexplored. In this work, we identify and systematically study a critical side effect of RFT, which we term the hallucination tax: a degradation in refusal behavior causing models to produce hallucinated answers to unanswerable questions confidently. To investigate this, we introduce SUM (Synthetic Unanswerable Math), a high-quality dataset of unanswerable math problems designed to probe models' ability to recognize an unanswerable question by reasoning from the insufficient or ambiguous information. Our results show that standard RFT training could reduce model refusal rates by more than 80%, which significantly increases model's tendency to hallucinate. We further demonstrate that incorporating just 10% SUM during RFT substantially restores appropriate refusal behavior, with minimal accuracy trade-offs on solvable tasks. Crucially, this approach enables LLMs to leverage inference-time compute to reason about their own uncertainty and knowledge boundaries, improving generalization not only to out-of-domain math problems but also to factual question answering tasks.

強化微調的幻覺代價

The Hallucination Tax of Reinforcement Finetuning

摘要

Support