수술 후 훈련: 오류 절감, 지식 보존

초록

대규모 언어 모델(LLM)의 추론 능력을 사후 훈련을 통해 향상시키는 것은 일반적으로 효율성과 파국적 망각 사이의 상충 관계로 인해 제약을 받습니다. 기존 연구에서는 정책 기반 데이터가 망각 완화에 중요한 역할을 강조해 왔지만, 본 연구에서는 직접 선호 최적화(DPO)의 보상 추정에 내재된 암묵적 정규화라는 간과되었던 핵심 메커니즘을 이론적 및 실증적으로 규명하고 검증합니다. 이는 추론 능력을 효율적으로 최적화하면서 습득된 사전 지식을 보존하도록 설계된 새로운 패러다임인 Surgical Post-Training(SPoT)의 동기가 됩니다. SPoT는 다음 두 가지로 구성됩니다: (1) 오라클을 활용하여 오류가 있는 추론 단계를 최소한의 편집으로 정밀하게 수정함으로써 모델의 분포에 근접한 데이터를 생성하는 데이터 정제 파이프라인, (2) 보상 기반 이진 교차 엔트로피 목적함수. DPO의 상대적 순위 결정과 달리, 이 목적함수는 추론 정확도를 이진 분류 문제로 취급하여 분리된 감독 신호를 적용합니다. 실험적으로, 단 4,000개의 정제된 수학 데이터 쌍만으로 SPoT는 Qwen3-8B의 정확도를 도메인 내 및 도메인 외 작업에서 평균 6.2% 향상시켰으며, 8개의 H800 GPU에서 약 28분의 훈련만을 필요로 합니다. 코드: https://github.com/Visual-AI/SPoT

English

Enhancing the reasoning capabilities of Large Language Models (LLMs) via post-training is often constrained by the trade-off between efficiency and catastrophic forgetting. While prior research emphasizes the role of on-policy data in mitigating forgetting, we uncover--and validate both theoretically and empirically--an overlooked yet critical mechanism: the implicit regularization inherent in Direct Preference Optimization's (DPO) reward estimate. This motivates our Surgical Post-Training (SPoT), a new paradigm designed to optimize reasoning efficiently while preserving learned prior knowledge. SPoT consists of: (1) a data rectification pipeline that employs an Oracle to surgically correct erroneous steps via minimal edits, generating data proximal to the model's distribution; and (2) a reward-based binary cross-entropy objective. Unlike the relative ranking in DPO, this objective treats reasoning correctness as a binary classification problem, enforcing decoupled supervision signals. Empirically, with only 4k rectified math data pairs, SPoT improves Qwen3-8B's accuracy by 6.2% on average across in-domain and OOD tasks, requiring merely 28 minutes of training on 8x H800 GPUs. Code: https://github.com/Visual-AI/SPoT

수술 후 훈련: 오류 절감, 지식 보존

Surgical Post-Training: Cutting Errors, Keeping Knowledge

초록

Support