外科术后精训：削减失误，巩固认知

摘要

通过后训练增强大型语言模型（LLM）的推理能力常受效率与灾难性遗忘之间权衡的制约。尽管现有研究强调在策略数据对缓解遗忘的作用，我们通过理论与实验双重验证揭示了一个被忽视的关键机制：直接偏好优化（DPO）奖励估计中固有的隐式正则化效应。基于此，我们提出外科手术式后训练（SPoT）新范式，旨在高效优化推理能力的同时保留已习得的先验知识。SPoT包含两大核心组件：（1）数据校正流程，通过Oracle对错误推理步骤进行最小化编辑的外科手术式修正，生成贴近模型分布的数据；（2）基于奖励的二元交叉熵目标函数。与DPO的相对排序机制不同，该目标将推理正确性视为二元分类问题，实施解耦的监督信号。实验表明，仅使用4千个校正后的数学数据对，SPoT即可在8×H800 GPU上通过28分钟训练，将Qwen3-8B模型在领域内及分布外任务上的平均准确率提升6.2%。代码地址：https://github.com/Visual-AI/SPoT

English

Enhancing the reasoning capabilities of Large Language Models (LLMs) via post-training is often constrained by the trade-off between efficiency and catastrophic forgetting. While prior research emphasizes the role of on-policy data in mitigating forgetting, we uncover--and validate both theoretically and empirically--an overlooked yet critical mechanism: the implicit regularization inherent in Direct Preference Optimization's (DPO) reward estimate. This motivates our Surgical Post-Training (SPoT), a new paradigm designed to optimize reasoning efficiently while preserving learned prior knowledge. SPoT consists of: (1) a data rectification pipeline that employs an Oracle to surgically correct erroneous steps via minimal edits, generating data proximal to the model's distribution; and (2) a reward-based binary cross-entropy objective. Unlike the relative ranking in DPO, this objective treats reasoning correctness as a binary classification problem, enforcing decoupled supervision signals. Empirically, with only 4k rectified math data pairs, SPoT improves Qwen3-8B's accuracy by 6.2% on average across in-domain and OOD tasks, requiring merely 28 minutes of training on 8x H800 GPUs. Code: https://github.com/Visual-AI/SPoT

外科术后精训：削减失误，巩固认知

Surgical Post-Training: Cutting Errors, Keeping Knowledge

摘要

Support