大型语言模型的无裁剪策略优化

摘要

强化学习已成为大型语言模型后训练的核心技术，但主流算法依赖的裁剪机制会在大规模应用中引发优化问题，包括零梯度区域、奖励破解和训练不稳定性。我们提出无裁剪策略优化（CFPO）方法，通过基于全变分散度约束推导出的凸二次惩罚项替代启发式裁剪，构建出处处可微的目标函数，在无需硬边界的情况下实现稳定的策略更新。我们在推理和对齐两种场景下评估CFPO：在推理任务中，CFPO在下游基准测试中与基于裁剪的方法表现相当，同时拓展了稳定训练区间；在对齐任务中，CFPO有效抑制了冗余表达利用现象，减轻了能力退化问题，并保持了具有竞争力的指令遵循性能。CFPO仅需单行代码修改且无需新增超参数。实验结果表明，CFPO是替代基于裁剪的LLM后训练方法的有前景的即插即用方案。

English

Reinforcement learning has become central to post-training large language models, yet dominant algorithms rely on clipping mechanisms that introduce optimization issues at scale, including zero-gradient regions, reward hacking, and training instability. We propose Clipping-Free Policy Optimization (CFPO), which replaces heuristic clipping with a convex quadratic penalty derived from Total Variation divergence constraints, yielding an everywhere-differentiable objective that enforces stable policy updates without hard boundaries. We evaluate CFPO across both reasoning and alignment settings. In reasoning, CFPO matches clipping-based methods on downstream benchmarks while extending the stable training regime. In alignment, CFPO mitigates verbosity exploitation and reduces capability degradation, while achieving competitive instruction-following performance. CFPO requires only a one-line code change and no additional hyperparameters. Our results suggest that CFPO is a promising drop-in alternative to clipping-based methods for LLM post-training.

大型语言模型的无裁剪策略优化

Clipping-Free Policy Optimization for Large Language Models

摘要

Support