다목적 정렬에서 목표 간 간섭 현상 규명

초록

우리는 대규모 언어 모델(LLM)의 다중 목표 얼라인먼트에서 지속적으로 관찰되는 실패 모드를 연구한다: 훈련이 일부 목표의 성능은 향상시키지만 다른 목표들의 성능은 저하시키는 현상이다. 우리는 이 현상을 **교차 목표 간섭**으로 형식화하고 고전적 스칼라화 알고리즘들을 대상으로 첫 체계적 연구를 수행하여, 간섭이 광범위하게 존재하며 강한 모델 의존성을 보인다는 점을 확인한다. 이 현상을 설명하기 위해, 우리는 목표의 보상이 스칼라화된 점수와 양의 공분산을 보일 때 1차 근사에서 해당 목표가 개선된다는 **국소 공분산 법칙**을 유도한다. 우리는 이 분석을 현대 얼라인먼트에서 사용되는 클리핑된 대리 목표로 확장하여, 클리핑이 적용되더라도 경미한 조건 하에서 공분산 법칙이 유효함을 입증한다. 이 분석을 바탕으로, 우리는 목표 보상과 훈련 신호 간의 양의 공분산을 유지하여 교차 목표 간섭을 효과적으로 완화하는 플러그앤플레이 방식인 **공분산 대상 가중치 적응(CTWA)**을 제안한다. 마지막으로, 우리는 이러한 국소 개선 조건을 Polyak–Łojasiewicz 조건 하에서의 **전역 수렴 분석**으로 보완하여, 비볼록 스칼라화 최적화가 언제 전역 수렴을 달성하는지, 그리고 교차 목표 간섭이 특정 모델의 기하학적 속성에 어떻게 의존하는지를 규명한다.

English

We study a persistent failure mode in multi-objective alignment for large language models (LLMs): training improves performance on only a subset of objectives while causing others to degrade. We formalize this phenomenon as cross-objective interference and conduct the first systematic study across classic scalarization algorithms, showing that interference is pervasive and exhibits strong model dependence. To explain this phenomenon, we derive a local covariance law showing that an objective improves at first order when its reward exhibits positive covariance with the scalarized score. We extend this analysis to clipped surrogate objectives used in modern alignment, demonstrating that the covariance law remains valid under mild conditions despite clipping. Building on this analysis, we propose Covariance Targeted Weight Adaptation (CTWA), a plug-and-play method that maintains positive covariance between objective rewards and the training signal to effectively mitigate cross-objective interference. Finally, we complement these local improvement conditions with a global convergence analysis under the Polyak--Łojasiewicz condition, establishing when non-convex scalarized optimization achieves global convergence and how cross-objective interference depends on specific model geometric properties.

다목적 정렬에서 목표 간 간섭 현상 규명

Uncovering Cross-Objective Interference in Multi-Objective Alignment

초록

Support