揭示多目标对齐中的跨目标干扰现象

摘要

我们研究了大语言模型多目标对齐中一个持续存在的失效模式：训练仅提升部分目标的性能，却导致其他目标性能下降。我们将这种现象形式化为跨目标干扰，并对经典标量化算法进行了首次系统性研究，结果表明干扰现象普遍存在且表现出强烈的模型依赖性。为解释这一现象，我们推导出局部协方差定律，证明当目标奖励与标量化得分呈现正协方差时，该目标在一阶意义上会得到改进。我们将此分析延伸至现代对齐中使用的裁剪替代目标，证明在温和条件下尽管存在裁剪操作，协方差定律仍然成立。基于此分析，我们提出协方差定向权重自适应（CTWA）这一即插即用方法，通过维持目标奖励与训练信号之间的正协方差来有效缓解跨目标干扰。最后，我们结合Polyak-Łojasiewicz条件对这些局部改进条件进行了全局收敛性分析，确立了非凸标量化优化实现全局收敛的条件，并揭示了跨目标干扰如何依赖于特定的模型几何特性。

English

We study a persistent failure mode in multi-objective alignment for large language models (LLMs): training improves performance on only a subset of objectives while causing others to degrade. We formalize this phenomenon as cross-objective interference and conduct the first systematic study across classic scalarization algorithms, showing that interference is pervasive and exhibits strong model dependence. To explain this phenomenon, we derive a local covariance law showing that an objective improves at first order when its reward exhibits positive covariance with the scalarized score. We extend this analysis to clipped surrogate objectives used in modern alignment, demonstrating that the covariance law remains valid under mild conditions despite clipping. Building on this analysis, we propose Covariance Targeted Weight Adaptation (CTWA), a plug-and-play method that maintains positive covariance between objective rewards and the training signal to effectively mitigate cross-objective interference. Finally, we complement these local improvement conditions with a global convergence analysis under the Polyak--Łojasiewicz condition, establishing when non-convex scalarized optimization achieves global convergence and how cross-objective interference depends on specific model geometric properties.

揭示多目标对齐中的跨目标干扰现象

Uncovering Cross-Objective Interference in Multi-Objective Alignment

摘要

Support