ChatPaper.aiChatPaper

數學推理能力能否提升大型語言模型的通用能力?探討LLM推理能力的遷移性

Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning

July 1, 2025
作者: Maggie Huan, Yuetai Li, Tuney Zheng, Xiaoyu Xu, Seungone Kim, Minxin Du, Radha Poovendran, Graham Neubig, Xiang Yue
cs.AI

摘要

數學推理已成為大型語言模型(LLM)進步的標誌,新模型在MATH和AIME等基準測試中迅速超越人類水平。然而,隨著數學排行榜每週不斷提升,值得一問的是:這些進步是否反映了更廣泛的問題解決能力,還是僅僅是狹隘的過度擬合?為回答這一問題,我們評估了超過20個開源權重的推理微調模型,涵蓋數學、科學問答、智能體規劃、編程以及標準指令遵循等多種任務。令人驚訝的是,我們發現大多數在數學上成功的模型無法將其優勢轉移到其他領域。為嚴格研究這一現象,我們在Qwen3-14B模型上進行了控制實驗,使用僅限數學的數據但不同的微調方法。我們發現,強化學習(RL)微調的模型在各領域中表現出良好的泛化能力,而監督微調(SFT)的模型往往會遺忘一般能力。潛在空間表示和詞元空間分佈偏移分析揭示,SFT會導致顯著的表示和輸出漂移,而RL則保留了通用領域的結構。我們的結果表明,有必要重新思考標準的後訓練方法,特別是依賴於SFT蒸餾數據來推進推理模型的發展。
English
Math reasoning has become the poster child of progress in large language models (LLMs), with new models rapidly surpassing human-level performance on benchmarks like MATH and AIME. But as math leaderboards improve week by week, it is worth asking: do these gains reflect broader problem-solving ability or just narrow overfitting? To answer this question, we evaluate over 20 open-weight reasoning-tuned models across a broad suite of tasks, including math, scientific QA, agent planning, coding, and standard instruction-following. We surprisingly find that most models that succeed in math fail to transfer their gains to other domains. To rigorously study this phenomenon, we conduct controlled experiments on Qwen3-14B models using math-only data but different tuning methods. We find that reinforcement learning (RL)-tuned models generalize well across domains, while supervised fine-tuning (SFT)-tuned models often forget general capabilities. Latent-space representation and token-space distribution shift analyses reveal that SFT induces substantial representation and output drift, while RL preserves general-domain structure. Our results suggest a need to rethink standard post-training recipes, particularly the reliance on SFT-distilled data for advancing reasoning models.
PDF412July 2, 2025