ChatPaper.aiChatPaper

大型语言模型的无裁剪策略优化

Clipping-Free Policy Optimization for Large Language Models

January 30, 2026
作者: Ömer Veysel Çağatan, Barış Akgün, Gözde Gül Şahin, Xuandong Zhao
cs.AI

摘要

强化学习已成为大型语言模型后训练的核心技术,但主流算法依赖的裁剪机制会在大规模应用中引发优化问题,包括零梯度区域、奖励破解和训练不稳定性。我们提出无裁剪策略优化(CFPO)方法,通过基于全变分散度约束推导出的凸二次惩罚项替代启发式裁剪,构建出处处可微的目标函数,在无需硬边界的情况下实现稳定的策略更新。我们在推理和对齐两种场景下评估CFPO:在推理任务中,CFPO在下游基准测试中与基于裁剪的方法表现相当,同时拓展了稳定训练区间;在对齐任务中,CFPO有效抑制了冗余表达利用现象,减轻了能力退化问题,并保持了具有竞争力的指令遵循性能。CFPO仅需单行代码修改且无需新增超参数。实验结果表明,CFPO是替代基于裁剪的LLM后训练方法的有前景的即插即用方案。
English
Reinforcement learning has become central to post-training large language models, yet dominant algorithms rely on clipping mechanisms that introduce optimization issues at scale, including zero-gradient regions, reward hacking, and training instability. We propose Clipping-Free Policy Optimization (CFPO), which replaces heuristic clipping with a convex quadratic penalty derived from Total Variation divergence constraints, yielding an everywhere-differentiable objective that enforces stable policy updates without hard boundaries. We evaluate CFPO across both reasoning and alignment settings. In reasoning, CFPO matches clipping-based methods on downstream benchmarks while extending the stable training regime. In alignment, CFPO mitigates verbosity exploitation and reduces capability degradation, while achieving competitive instruction-following performance. CFPO requires only a one-line code change and no additional hyperparameters. Our results suggest that CFPO is a promising drop-in alternative to clipping-based methods for LLM post-training.
PDF22March 12, 2026