數學推理能力能否提升大型語言模型的通用能力？探討LLM推理能力的遷移性

摘要

數學推理已成為大型語言模型（LLM）進步的標誌，新模型在MATH和AIME等基準測試中迅速超越人類水平。然而，隨著數學排行榜每週不斷提升，值得一問的是：這些進步是否反映了更廣泛的問題解決能力，還是僅僅是狹隘的過度擬合？為回答這一問題，我們評估了超過20個開源權重的推理微調模型，涵蓋數學、科學問答、智能體規劃、編程以及標準指令遵循等多種任務。令人驚訝的是，我們發現大多數在數學上成功的模型無法將其優勢轉移到其他領域。為嚴格研究這一現象，我們在Qwen3-14B模型上進行了控制實驗，使用僅限數學的數據但不同的微調方法。我們發現，強化學習（RL）微調的模型在各領域中表現出良好的泛化能力，而監督微調（SFT）的模型往往會遺忘一般能力。潛在空間表示和詞元空間分佈偏移分析揭示，SFT會導致顯著的表示和輸出漂移，而RL則保留了通用領域的結構。我們的結果表明，有必要重新思考標準的後訓練方法，特別是依賴於SFT蒸餾數據來推進推理模型的發展。

English

Math reasoning has become the poster child of progress in large language models (LLMs), with new models rapidly surpassing human-level performance on benchmarks like MATH and AIME. But as math leaderboards improve week by week, it is worth asking: do these gains reflect broader problem-solving ability or just narrow overfitting? To answer this question, we evaluate over 20 open-weight reasoning-tuned models across a broad suite of tasks, including math, scientific QA, agent planning, coding, and standard instruction-following. We surprisingly find that most models that succeed in math fail to transfer their gains to other domains. To rigorously study this phenomenon, we conduct controlled experiments on Qwen3-14B models using math-only data but different tuning methods. We find that reinforcement learning (RL)-tuned models generalize well across domains, while supervised fine-tuning (SFT)-tuned models often forget general capabilities. Latent-space representation and token-space distribution shift analyses reveal that SFT induces substantial representation and output drift, while RL preserves general-domain structure. Our results suggest a need to rethink standard post-training recipes, particularly the reliance on SFT-distilled data for advancing reasoning models.

數學推理能力能否提升大型語言模型的通用能力？探討LLM推理能力的遷移性

Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning

摘要

Support