오프라인 선호도 최적화를 위한 잠재적 적대적 정규화

초록

사람의 피드백을 통한 학습은 일반적으로 토큰 수준 정규화를 통해 정책 업데이트를 제한하는 선호도 최적화에 의존합니다. 그러나 언어 모델의 선호도 최적화는 토큰 공간 유사성이 의미론적 또는 행동 유사성을 보장하지 않기 때문에 특히 어려운 과제입니다. 이러한 문제를 해결하기 위해 우리는 언어 모델 선호도 최적화를 위한 잠재 공간 정규화를 활용합니다. 우리는 정책 모델과 참조 모델의 내부 표현 간 차이를 패널티로 부과하여 잠재 공간 정규화를 달성하는 GANPO를 제안합니다. 잠재 표현이 명시적 확률 밀도와 연관되지 않는다는 점을 고려하여, 우리는 GAN에서 영감을 받은 적대적 접근법을 채택하여 잠재 공간 차이를 최소화합니다. GANPO를 기존 오프라인 선호도 최적화 목표에 정규화 항으로 통합합니다. 다양한 모델 아키텍처와 작업에 대한 실험 결과, 잠재 공간 정규화를 통해 지속적인 성능 향상을 확인했습니다. 또한 GANPO에 의해 유도된 추론 편향과 토큰 수준 정규화의 편향을 비교한 결과, GANPO가 분포 변화와 노이즈 하에서 더 강건한 구조적 피드백을 제공하며, 하위 작업 성능은 비슷한 수준을 유지하면서 계산 오버헤드는 미미한 것으로 나타났습니다.

English

Learning from human feedback typically relies on preference optimization that constrains policy updates through token-level regularization. However, preference optimization for language models is particularly challenging because token-space similarity does not imply semantic or behavioral similarity. To address this challenge, we leverage latent-space regularization for language model preference optimization. We introduce GANPO, which achieves latent-space regularization by penalizing divergence between the internal representations of a policy model and a reference model. Given that latent representations are not associated with explicit probability densities, we adopt an adversarial approach inspired by GANs to minimize latent-space divergence. We integrate GANPO as a regularizer into existing offline preference optimization objectives. Experiments across multiple model architectures and tasks show consistent improvements from latent-space regularization. Further, by comparing GANPO-induced inferential biases with those from token-level regularization, we find that GANPO provides more robust structural feedback under distributional shift and noise while maintaining comparable downstream performance with minor computational overhead.

오프라인 선호도 최적화를 위한 잠재적 적대적 정규화

Latent Adversarial Regularization for Offline Preference Optimization

초록

Support