ChatPaper.aiChatPaper

大型语言模型的无裁剪策略优化

Clipping-Free Policy Optimization for Large Language Models

January 30, 2026
作者: Ömer Veysel Çağatan, Barış Akgün, Gözde Gül Şahin, Xuandong Zhao
cs.AI

摘要

強化學習已成為大型語言模型後訓練階段的關鍵技術,但主流算法依賴的剪裁機制在規模化應用中會引發優化問題,包括零梯度區域、獎勵攻擊和訓練不穩定性。我們提出無剪裁策略優化(CFPO)方法,通過基於全變差散度約束的凸二次懲罰項替代啟發式剪裁,形成處處可微的目標函數,在無需硬邊界的情況下實現穩定策略更新。我們在推理任務和對齊任務中對CFPO進行評估:在推理場景中,CFPO在下游基準測試中與基於剪裁的方法表現相當,同時拓展了穩定訓練區間;在對齊場景中,該方法能有效緩解冗餘表達投機行為並降低能力退化,同時保持具有競爭力的指令遵循性能。CFPO僅需單行代碼修改且無需新增超參數。實驗結果表明,CFPO有望成為基於剪裁的LLM後訓練方法的無縫替代方案。
English
Reinforcement learning has become central to post-training large language models, yet dominant algorithms rely on clipping mechanisms that introduce optimization issues at scale, including zero-gradient regions, reward hacking, and training instability. We propose Clipping-Free Policy Optimization (CFPO), which replaces heuristic clipping with a convex quadratic penalty derived from Total Variation divergence constraints, yielding an everywhere-differentiable objective that enforces stable policy updates without hard boundaries. We evaluate CFPO across both reasoning and alignment settings. In reasoning, CFPO matches clipping-based methods on downstream benchmarks while extending the stable training regime. In alignment, CFPO mitigates verbosity exploitation and reduces capability degradation, while achieving competitive instruction-following performance. CFPO requires only a one-line code change and no additional hyperparameters. Our results suggest that CFPO is a promising drop-in alternative to clipping-based methods for LLM post-training.
PDF22February 7, 2026