通过镜像近端加速基于人类反馈的纳什学习
Accelerating Nash Learning from Human Feedback via Mirror Prox
May 26, 2025
作者: Daniil Tiapkin, Daniele Calandriello, Denis Belomestny, Eric Moulines, Alexey Naumov, Kashif Rasul, Michal Valko, Pierre Menard
cs.AI
摘要
传统的人类反馈强化学习(RLHF)通常依赖于奖励模型,并常采用如Bradley-Terry模型等偏好结构,这些模型可能无法准确捕捉真实人类偏好的复杂性(例如,非传递性)。纳什人类反馈学习(NLHF)提供了一种更为直接的替代方案,它将问题框架化为寻找由这些偏好定义的博弈的纳什均衡。在本研究中,我们引入了纳什镜像近端(Nash-MP),这是一种在线NLHF算法,它利用镜像近端优化方案实现快速且稳定地收敛至纳什均衡。我们的理论分析表明,Nash-MP展现出对β-正则化纳什均衡的末次迭代线性收敛性。具体而言,我们证明了KL散度至最优策略的下降速率为(1+2β)^{-N/2}阶,其中N为偏好查询次数。我们进一步展示了对于利用性差距及对数概率的跨度半范数,Nash-MP均实现了末次迭代线性收敛,且所有这些收敛速率均与动作空间的大小无关。此外,我们提出并分析了一种Nash-MP的近似版本,其中近端步骤通过随机策略梯度进行估计,使得算法更接近实际应用。最后,我们详细阐述了一种用于微调大型语言模型的实用实施策略,并通过实验展示了其竞争性能及与现有方法的兼容性。
English
Traditional Reinforcement Learning from Human Feedback (RLHF) often relies on
reward models, frequently assuming preference structures like the Bradley-Terry
model, which may not accurately capture the complexities of real human
preferences (e.g., intransitivity). Nash Learning from Human Feedback (NLHF)
offers a more direct alternative by framing the problem as finding a Nash
equilibrium of a game defined by these preferences. In this work, we introduce
Nash Mirror Prox (Nash-MP), an online NLHF algorithm that leverages
the Mirror Prox optimization scheme to achieve fast and stable convergence to
the Nash equilibrium. Our theoretical analysis establishes that Nash-MP
exhibits last-iterate linear convergence towards the beta-regularized Nash
equilibrium. Specifically, we prove that the KL-divergence to the optimal
policy decreases at a rate of order (1+2beta)^{-N/2}, where N is a number
of preference queries. We further demonstrate last-iterate linear convergence
for the exploitability gap and uniformly for the span semi-norm of
log-probabilities, with all these rates being independent of the size of the
action space. Furthermore, we propose and analyze an approximate version of
Nash-MP where proximal steps are estimated using stochastic policy gradients,
making the algorithm closer to applications. Finally, we detail a practical
implementation strategy for fine-tuning large language models and present
experiments that demonstrate its competitive performance and compatibility with
existing methods.Summary
AI-Generated Summary