數學推理能力能否提升大型語言模型的通用能力?探討LLM推理能力的遷移性
Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning
July 1, 2025
作者: Maggie Huan, Yuetai Li, Tuney Zheng, Xiaoyu Xu, Seungone Kim, Minxin Du, Radha Poovendran, Graham Neubig, Xiang Yue
cs.AI
摘要
數學推理已成為大型語言模型(LLM)進步的標誌,新模型在MATH和AIME等基準測試中迅速超越人類水平。然而,隨著數學排行榜每週不斷提升,值得一問的是:這些進步是否反映了更廣泛的問題解決能力,還是僅僅是狹隘的過度擬合?為回答這一問題,我們評估了超過20個開源權重的推理微調模型,涵蓋數學、科學問答、智能體規劃、編程以及標準指令遵循等多種任務。令人驚訝的是,我們發現大多數在數學上成功的模型無法將其優勢轉移到其他領域。為嚴格研究這一現象,我們在Qwen3-14B模型上進行了控制實驗,使用僅限數學的數據但不同的微調方法。我們發現,強化學習(RL)微調的模型在各領域中表現出良好的泛化能力,而監督微調(SFT)的模型往往會遺忘一般能力。潛在空間表示和詞元空間分佈偏移分析揭示,SFT會導致顯著的表示和輸出漂移,而RL則保留了通用領域的結構。我們的結果表明,有必要重新思考標準的後訓練方法,特別是依賴於SFT蒸餾數據來推進推理模型的發展。
English
Math reasoning has become the poster child of progress in large language
models (LLMs), with new models rapidly surpassing human-level performance on
benchmarks like MATH and AIME. But as math leaderboards improve week by week,
it is worth asking: do these gains reflect broader problem-solving ability or
just narrow overfitting? To answer this question, we evaluate over 20
open-weight reasoning-tuned models across a broad suite of tasks, including
math, scientific QA, agent planning, coding, and standard
instruction-following. We surprisingly find that most models that succeed in
math fail to transfer their gains to other domains. To rigorously study this
phenomenon, we conduct controlled experiments on Qwen3-14B models using
math-only data but different tuning methods. We find that reinforcement
learning (RL)-tuned models generalize well across domains, while supervised
fine-tuning (SFT)-tuned models often forget general capabilities. Latent-space
representation and token-space distribution shift analyses reveal that SFT
induces substantial representation and output drift, while RL preserves
general-domain structure. Our results suggest a need to rethink standard
post-training recipes, particularly the reliance on SFT-distilled data for
advancing reasoning models.