外科术后精训:削减失误,巩固认知
Surgical Post-Training: Cutting Errors, Keeping Knowledge
March 2, 2026
作者: Wenye Lin, Kai Han
cs.AI
摘要
通过后训练增强大型语言模型(LLM)的推理能力常受效率与灾难性遗忘之间权衡的制约。尽管现有研究强调在策略数据对缓解遗忘的作用,我们通过理论与实验双重验证揭示了一个被忽视的关键机制:直接偏好优化(DPO)奖励估计中固有的隐式正则化效应。基于此,我们提出外科手术式后训练(SPoT)新范式,旨在高效优化推理能力的同时保留已习得的先验知识。SPoT包含两大核心组件:(1)数据校正流程,通过Oracle对错误推理步骤进行最小化编辑的外科手术式修正,生成贴近模型分布的数据;(2)基于奖励的二元交叉熵目标函数。与DPO的相对排序机制不同,该目标将推理正确性视为二元分类问题,实施解耦的监督信号。实验表明,仅使用4千个校正后的数学数据对,SPoT即可在8×H800 GPU上通过28分钟训练,将Qwen3-8B模型在领域内及分布外任务上的平均准确率提升6.2%。代码地址:https://github.com/Visual-AI/SPoT
English
Enhancing the reasoning capabilities of Large Language Models (LLMs) via post-training is often constrained by the trade-off between efficiency and catastrophic forgetting. While prior research emphasizes the role of on-policy data in mitigating forgetting, we uncover--and validate both theoretically and empirically--an overlooked yet critical mechanism: the implicit regularization inherent in Direct Preference Optimization's (DPO) reward estimate. This motivates our Surgical Post-Training (SPoT), a new paradigm designed to optimize reasoning efficiently while preserving learned prior knowledge. SPoT consists of: (1) a data rectification pipeline that employs an Oracle to surgically correct erroneous steps via minimal edits, generating data proximal to the model's distribution; and (2) a reward-based binary cross-entropy objective. Unlike the relative ranking in DPO, this objective treats reasoning correctness as a binary classification problem, enforcing decoupled supervision signals. Empirically, with only 4k rectified math data pairs, SPoT improves Qwen3-8B's accuracy by 6.2% on average across in-domain and OOD tasks, requiring merely 28 minutes of training on 8x H800 GPUs. Code: https://github.com/Visual-AI/SPoT