手術後トレーニング：エラーを削減し、知識を保持する

要旨

大規模言語モデル（LLM）の推論能力を学習後調整によって強化する際には、効率性と破滅的忘却の間のトレードオフが制約となることが多い。従来の研究はオン方策データが忘却緩和に果たす役割を強調してきたが、本研究では、Direct Preference Optimization（DPO）の報酬推定に内在する暗黙的正則化という、見過ごされながらも極めて重要なメカニズムを理論的・実証的に解明する。この知見に基づき、我々は推論能力を効率的に最適化しつつ獲得済みの事前知識を保持する新しいパラダイム「Surgical Post-Training（SPoT）」を提案する。SPoTは以下で構成される：（1）オラクルを用いて誤った推論ステップを最小限の編集で外科的に修正し、モデルの分布に近いデータを生成するデータ補正パイプライン、（2）報酬ベースの二値交差エントロピー目的関数。後者はDPOの相対的順位付けとは異なり、推論の正しさを二値分類問題として扱い、分離された監督信号を付与する。実証実験では、わずか4,000組の補正数学データを用いたSPoTが、Qwen3-8Bの精度を分野内タスクとOODタスクで平均6.2%向上させ、8基のH800 GPUでの学習時間は28分のみであった。コード：https://github.com/Visual-AI/SPoT

English

Enhancing the reasoning capabilities of Large Language Models (LLMs) via post-training is often constrained by the trade-off between efficiency and catastrophic forgetting. While prior research emphasizes the role of on-policy data in mitigating forgetting, we uncover--and validate both theoretically and empirically--an overlooked yet critical mechanism: the implicit regularization inherent in Direct Preference Optimization's (DPO) reward estimate. This motivates our Surgical Post-Training (SPoT), a new paradigm designed to optimize reasoning efficiently while preserving learned prior knowledge. SPoT consists of: (1) a data rectification pipeline that employs an Oracle to surgically correct erroneous steps via minimal edits, generating data proximal to the model's distribution; and (2) a reward-based binary cross-entropy objective. Unlike the relative ranking in DPO, this objective treats reasoning correctness as a binary classification problem, enforcing decoupled supervision signals. Empirically, with only 4k rectified math data pairs, SPoT improves Qwen3-8B's accuracy by 6.2% on average across in-domain and OOD tasks, requiring merely 28 minutes of training on 8x H800 GPUs. Code: https://github.com/Visual-AI/SPoT

手術後トレーニング：エラーを削減し、知識を保持する

Surgical Post-Training: Cutting Errors, Keeping Knowledge

要旨

Support