潜在对抗正则化在离线偏好优化中的应用
Latent Adversarial Regularization for Offline Preference Optimization
January 29, 2026
作者: Enyi Jiang, Yibo Jacky Zhang, Yinglun Xu, Andreas Haupt, Nancy Amato, Sanmi Koyejo
cs.AI
摘要
基於人類回饋的學習通常依賴偏好優化,其透過詞元層級的規範化來約束策略更新。然而,語言模型的偏好優化尤其具有挑戰性,因為詞元空間的相似性並不意味著語義或行為的相似性。為解決此問題,我們利用潛在空間規範化進行語言模型偏好優化。本文提出GANPO方法,通過懲罰策略模型與參考模型內部表徵之間的差異來實現潛在空間規範化。鑒於潛在表徵不具有顯式機率密度,我們採用受生成對抗網絡啟發的對抗式方法來最小化潛在空間差異。我們將GANPO作為規範化項整合至現有的離線偏好優化目標中。在多種模型架構與任務上的實驗表明,潛在空間規範化能帶來持續改進。進一步通過比較GANPO與詞元層級規範化所誘導的推論偏差,我們發現GANPO在分佈偏移和噪聲條件下能提供更穩健的結構性回饋,同時以微小的計算開銷保持相當的下游任務性能。
English
Learning from human feedback typically relies on preference optimization that constrains policy updates through token-level regularization. However, preference optimization for language models is particularly challenging because token-space similarity does not imply semantic or behavioral similarity. To address this challenge, we leverage latent-space regularization for language model preference optimization. We introduce GANPO, which achieves latent-space regularization by penalizing divergence between the internal representations of a policy model and a reference model. Given that latent representations are not associated with explicit probability densities, we adopt an adversarial approach inspired by GANs to minimize latent-space divergence. We integrate GANPO as a regularizer into existing offline preference optimization objectives. Experiments across multiple model architectures and tasks show consistent improvements from latent-space regularization. Further, by comparing GANPO-induced inferential biases with those from token-level regularization, we find that GANPO provides more robust structural feedback under distributional shift and noise while maintaining comparable downstream performance with minor computational overhead.