透過鏡像近端加速基於人類回饋的納許學習
Accelerating Nash Learning from Human Feedback via Mirror Prox
May 26, 2025
作者: Daniil Tiapkin, Daniele Calandriello, Denis Belomestny, Eric Moulines, Alexey Naumov, Kashif Rasul, Michal Valko, Pierre Menard
cs.AI
摘要
傳統的基於人類反饋的強化學習(RLHF)通常依賴於獎勵模型,並經常假設如Bradley-Terry模型這樣的偏好結構,這些模型可能無法準確捕捉真實人類偏好的複雜性(例如,不可傳遞性)。基於納什均衡的人類反饋學習(NLHF)提供了一種更為直接的替代方案,通過將問題框架為尋找由這些偏好定義的博弈的納什均衡。在本研究中,我們引入了納什鏡像近端(Nash-MP),這是一種在線NLHF算法,利用鏡像近端優化方案實現快速且穩定地收斂到納什均衡。我們的理論分析表明,Nash-MP展現出對beta正則化納什均衡的最後迭代線性收斂性。具體而言,我們證明了最優策略的KL散度以(1+2beta)^{-N/2}的速率遞減,其中N是偏好查詢的次數。我們進一步展示了對可利用性差距以及對數概率的跨度半範數的最後迭代線性收斂性,所有這些速率均與動作空間的大小無關。此外,我們提出並分析了一種近似版本的Nash-MP,其中近端步驟通過隨機策略梯度進行估計,使算法更接近實際應用。最後,我們詳細介紹了一種用於微調大型語言模型的實踐實施策略,並展示了其競爭性能及與現有方法的兼容性。
English
Traditional Reinforcement Learning from Human Feedback (RLHF) often relies on
reward models, frequently assuming preference structures like the Bradley-Terry
model, which may not accurately capture the complexities of real human
preferences (e.g., intransitivity). Nash Learning from Human Feedback (NLHF)
offers a more direct alternative by framing the problem as finding a Nash
equilibrium of a game defined by these preferences. In this work, we introduce
Nash Mirror Prox (Nash-MP), an online NLHF algorithm that leverages
the Mirror Prox optimization scheme to achieve fast and stable convergence to
the Nash equilibrium. Our theoretical analysis establishes that Nash-MP
exhibits last-iterate linear convergence towards the beta-regularized Nash
equilibrium. Specifically, we prove that the KL-divergence to the optimal
policy decreases at a rate of order (1+2beta)^{-N/2}, where N is a number
of preference queries. We further demonstrate last-iterate linear convergence
for the exploitability gap and uniformly for the span semi-norm of
log-probabilities, with all these rates being independent of the size of the
action space. Furthermore, we propose and analyze an approximate version of
Nash-MP where proximal steps are estimated using stochastic policy gradients,
making the algorithm closer to applications. Finally, we detail a practical
implementation strategy for fine-tuning large language models and present
experiments that demonstrate its competitive performance and compatibility with
existing methods.Summary
AI-Generated Summary