ミラー近接法による人間のフィードバックからのナッシュ学習の高速化

要旨

従来の人間フィードバックからの強化学習（RLHF）は、報酬モデルに依存することが多く、Bradley-Terryモデルのような選好構造を仮定することが一般的です。しかし、このモデルは現実の人間の選好の複雑さ（例えば、非推移性）を正確に捉えることができない場合があります。人間フィードバックからのナッシュ学習（NLHF）は、これらの選好によって定義されるゲームのナッシュ均衡を見つける問題として定式化することで、より直接的な代替手段を提供します。本研究では、Mirror Prox最適化スキームを活用してナッシュ均衡への高速かつ安定した収束を実現するオンラインNLHFアルゴリズムであるNash Mirror Prox（Nash-MP）を紹介します。理論分析により、Nash-MPがベータ正則化されたナッシュ均衡に向けて最終反復線形収束を示すことを確立します。具体的には、最適ポリシーへのKLダイバージェンスが(1+2beta)^{-N/2}のオーダーで減少することを証明します。ここで、Nは選好クエリの数です。さらに、エクスプロイタビリティギャップと対数確率のスパン半ノルムに対して、最終反復線形収束を示し、これらの収束率が行動空間のサイズに依存しないことを示します。さらに、近接ステップを確率的ポリシー勾配を使用して推定するNash-MPの近似バージョンを提案し、分析することで、アルゴリズムを応用に近づけます。最後に、大規模言語モデルのファインチューニングのための実用的な実装戦略を詳細に説明し、その競争力のある性能と既存手法との互換性を示す実験結果を提示します。

English

Traditional Reinforcement Learning from Human Feedback (RLHF) often relies on reward models, frequently assuming preference structures like the Bradley-Terry model, which may not accurately capture the complexities of real human preferences (e.g., intransitivity). Nash Learning from Human Feedback (NLHF) offers a more direct alternative by framing the problem as finding a Nash equilibrium of a game defined by these preferences. In this work, we introduce Nash Mirror Prox (Nash-MP), an online NLHF algorithm that leverages the Mirror Prox optimization scheme to achieve fast and stable convergence to the Nash equilibrium. Our theoretical analysis establishes that Nash-MP exhibits last-iterate linear convergence towards the beta-regularized Nash equilibrium. Specifically, we prove that the KL-divergence to the optimal policy decreases at a rate of order (1+2beta)^{-N/2}, where N is a number of preference queries. We further demonstrate last-iterate linear convergence for the exploitability gap and uniformly for the span semi-norm of log-probabilities, with all these rates being independent of the size of the action space. Furthermore, we propose and analyze an approximate version of Nash-MP where proximal steps are estimated using stochastic policy gradients, making the algorithm closer to applications. Finally, we detail a practical implementation strategy for fine-tuning large language models and present experiments that demonstrate its competitive performance and compatibility with existing methods.

ミラー近接法による人間のフィードバックからのナッシュ学習の高速化

Accelerating Nash Learning from Human Feedback via Mirror Prox

要旨

Support