多目的アライメントにおける目的間干渉の解明

要旨

大規模言語モデル（LLM）における多目的アライメントの持続的な失敗モードについて検討する：訓練によって一部の目的関数での性能は向上するが、他の目的関数での性能が低下する現象である。本論文ではこの現象を**目的間干渉**として定式化し、古典的なスカラー化アルゴリズムにおいて初めて体系的な研究を行い、干渉が広範に存在し、強いモデル依存性を示すことを明らかにする。この現象を説明するため、**局所共分散法則**を導出する。これは、目的関数の報酬がスカラー化されたスコアと正の共分散を示す場合に、一次の範囲でその目的関数が改善することを示す。この分析を現代的なアライメントで用いられるクリップ付き代理目的関数に拡張し、クリッピング下においても穏やかな条件の下で共分散法則が有効であることを示す。この分析に基づき、**共分散対象重み適応法（CTWA）** を提案する。これはプラグアンドプレイ方式の手法であり、目的関数の報酬と訓練信号との間に正の共分散を維持することで、目的間干渉を効果的に緩和する。最後に、これらの局所的改善条件を、Polyak-Łojasiewicz条件に基づく大域的収束解析で補完する。非凸なスカラー化最適化が大域的収束を達成する条件と、目的間干渉が特定のモデルの幾何学的性質にどのように依存するかを明らかにする。

English

We study a persistent failure mode in multi-objective alignment for large language models (LLMs): training improves performance on only a subset of objectives while causing others to degrade. We formalize this phenomenon as cross-objective interference and conduct the first systematic study across classic scalarization algorithms, showing that interference is pervasive and exhibits strong model dependence. To explain this phenomenon, we derive a local covariance law showing that an objective improves at first order when its reward exhibits positive covariance with the scalarized score. We extend this analysis to clipped surrogate objectives used in modern alignment, demonstrating that the covariance law remains valid under mild conditions despite clipping. Building on this analysis, we propose Covariance Targeted Weight Adaptation (CTWA), a plug-and-play method that maintains positive covariance between objective rewards and the training signal to effectively mitigate cross-objective interference. Finally, we complement these local improvement conditions with a global convergence analysis under the Polyak--Łojasiewicz condition, establishing when non-convex scalarized optimization achieves global convergence and how cross-objective interference depends on specific model geometric properties.

多目的アライメントにおける目的間干渉の解明

Uncovering Cross-Objective Interference in Multi-Objective Alignment

要旨

Support