数学推理能否提升大语言模型的通用能力？理解大语言模型推理的迁移性

摘要

数学推理已成为大语言模型（LLMs）进步的标志，新模型在MATH和AIME等基准测试中迅速超越人类水平。然而，随着数学排行榜每周不断刷新，值得一问的是：这些提升反映的是更广泛的问题解决能力，还是仅仅局限于过拟合？为解答这一问题，我们评估了超过20个开放权重的推理调优模型，涵盖数学、科学问答、智能体规划、编码及标准指令遵循等广泛任务。令人惊讶的是，我们发现大多数在数学上表现优异的模型未能将其优势迁移至其他领域。为严谨研究这一现象，我们利用仅含数学数据的Qwen3-14B模型进行了控制实验，采用不同调优方法。结果表明，强化学习（RL）调优的模型在跨领域泛化上表现良好，而监督微调（SFT）调优的模型常遗忘通用能力。潜在空间表示与词元空间分布偏移分析揭示，SFT引发显著的表示与输出漂移，而RL则保留了通用领域结构。我们的研究结果提示，需重新审视标准的训练后优化策略，特别是依赖SFT蒸馏数据来推进推理模型的做法。

English

Math reasoning has become the poster child of progress in large language models (LLMs), with new models rapidly surpassing human-level performance on benchmarks like MATH and AIME. But as math leaderboards improve week by week, it is worth asking: do these gains reflect broader problem-solving ability or just narrow overfitting? To answer this question, we evaluate over 20 open-weight reasoning-tuned models across a broad suite of tasks, including math, scientific QA, agent planning, coding, and standard instruction-following. We surprisingly find that most models that succeed in math fail to transfer their gains to other domains. To rigorously study this phenomenon, we conduct controlled experiments on Qwen3-14B models using math-only data but different tuning methods. We find that reinforcement learning (RL)-tuned models generalize well across domains, while supervised fine-tuning (SFT)-tuned models often forget general capabilities. Latent-space representation and token-space distribution shift analyses reveal that SFT induces substantial representation and output drift, while RL preserves general-domain structure. Our results suggest a need to rethink standard post-training recipes, particularly the reliance on SFT-distilled data for advancing reasoning models.

数学推理能否提升大语言模型的通用能力？理解大语言模型推理的迁移性

Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning

摘要

Support