ChatPaper.aiChatPaper

潜在对抗正则化在离线偏好优化中的应用

Latent Adversarial Regularization for Offline Preference Optimization

January 29, 2026
作者: Enyi Jiang, Yibo Jacky Zhang, Yinglun Xu, Andreas Haupt, Nancy Amato, Sanmi Koyejo
cs.AI

摘要

传统基于人类反馈的学习通常依赖偏好优化,其通过词元级正则化来约束策略更新。然而,语言模型的偏好优化面临特殊挑战,因为词元空间的相似性并不等同于语义或行为层面的相似性。为解决这一难题,我们提出利用潜空间正则化进行语言模型偏好优化。本文引入GANPO方法,通过惩罚策略模型与参考模型内部表征之间的差异来实现潜空间正则化。鉴于潜表征缺乏显式概率密度描述,我们采用受生成对抗网络启发的对抗式训练来最小化潜空间差异。我们将GANPO作为正则化项集成到现有离线偏好优化目标中。在多类模型架构和任务上的实验表明,潜空间正则化能带来持续性能提升。进一步通过对比GANPO与词元级正则化引发的推断偏差,发现GANPO在分布偏移和噪声干扰下能提供更稳健的结构性反馈,同时以微小计算开销保持相当的下游性能。
English
Learning from human feedback typically relies on preference optimization that constrains policy updates through token-level regularization. However, preference optimization for language models is particularly challenging because token-space similarity does not imply semantic or behavioral similarity. To address this challenge, we leverage latent-space regularization for language model preference optimization. We introduce GANPO, which achieves latent-space regularization by penalizing divergence between the internal representations of a policy model and a reference model. Given that latent representations are not associated with explicit probability densities, we adopt an adversarial approach inspired by GANs to minimize latent-space divergence. We integrate GANPO as a regularizer into existing offline preference optimization objectives. Experiments across multiple model architectures and tasks show consistent improvements from latent-space regularization. Further, by comparing GANPO-induced inferential biases with those from token-level regularization, we find that GANPO provides more robust structural feedback under distributional shift and noise while maintaining comparable downstream performance with minor computational overhead.
PDF102January 31, 2026