強化微調的幻覺代價
The Hallucination Tax of Reinforcement Finetuning
May 20, 2025
作者: Linxin Song, Taiwei Shi, Jieyu Zhao
cs.AI
摘要
強化微調(Reinforcement Finetuning, RFT)已成為提升大型語言模型(LLMs)推理能力的標準方法。然而,其對模型可信度的影響仍未被充分探討。在本研究中,我們識別並系統性地研究了RFT的一個關鍵副作用,我們稱之為「幻覺稅」:這導致模型在面對無法回答的問題時,拒絕行為退化,從而自信地產生虛構的答案。為探究此現象,我們引入了SUM(Synthetic Unanswerable Math),這是一個高品質的不可解答數學問題數據集,旨在通過從不充分或模糊的信息中推理,來測試模型識別不可回答問題的能力。我們的結果顯示,標準的RFT訓練可能使模型拒絕率降低超過80%,這顯著增加了模型產生幻覺的傾向。我們進一步證明,在RFT過程中僅加入10%的SUM數據,即可大幅恢復適當的拒絕行為,且對可解答任務的準確性影響最小。關鍵在於,這種方法使LLMs能夠利用推理時的計算資源來思考其自身的不確定性和知識邊界,不僅提升了對域外數學問題的泛化能力,也改善了事實性問答任務的表現。
English
Reinforcement finetuning (RFT) has become a standard approach for enhancing
the reasoning capabilities of large language models (LLMs). However, its impact
on model trustworthiness remains underexplored. In this work, we identify and
systematically study a critical side effect of RFT, which we term the
hallucination tax: a degradation in refusal behavior causing models to produce
hallucinated answers to unanswerable questions confidently. To investigate
this, we introduce SUM (Synthetic Unanswerable Math), a high-quality dataset of
unanswerable math problems designed to probe models' ability to recognize an
unanswerable question by reasoning from the insufficient or ambiguous
information. Our results show that standard RFT training could reduce model
refusal rates by more than 80%, which significantly increases model's tendency
to hallucinate. We further demonstrate that incorporating just 10% SUM during
RFT substantially restores appropriate refusal behavior, with minimal accuracy
trade-offs on solvable tasks. Crucially, this approach enables LLMs to leverage
inference-time compute to reason about their own uncertainty and knowledge
boundaries, improving generalization not only to out-of-domain math problems
but also to factual question answering tasks.