Beloningsrobuuste RLHF in LLM's

Samenvatting

Naarmate Grote Taalmodellen (LLM's) blijven evolueren naar meer geavanceerde vormen van intelligentie, wordt Versterkend Leren van Menselijke Feedback (RLHF) steeds vaker gezien als een belangrijke weg naar het bereiken van Kunstmatige Algemene Intelligentie (AGI). Echter, de afhankelijkheid van beloningsmodel-gebaseerde (RM-gebaseerde) afstemmingsmethoden introduceert aanzienlijke uitdagingen vanwege de inherente instabiliteit en onvolkomenheden van Beloningsmodellen (RMs), die kunnen leiden tot kritieke problemen zoals beloningsmanipulatie en misalignering met menselijke intenties. In dit artikel introduceren we een beloning-robuust RLHF-framework dat gericht is op het aanpakken van deze fundamentele uitdagingen, waardoor de weg wordt vrijgemaakt voor meer betrouwbaar en veerkrachtig leren in LLM's. Onze aanpak introduceert een nieuw optimalisatiedoel dat zorgvuldig prestatie en robuustheid in balans brengt door Bayesian Reward Model Ensembles (BRME) op te nemen om de onzekerheidsset van beloningsfuncties te modelleren. Dit stelt het framework in staat om zowel nominale prestaties als minimale beloningsignalen te integreren, wat zorgt voor stabieler leren zelfs met imperfecte beloningsmodellen. Empirische resultaten tonen aan dat ons framework consequent beter presteert dan traditioneel RLHF over diverse benchmarks, met verbeterde nauwkeurigheid en langetermijnstabiliteit. We bieden ook een theoretische analyse die aantoont dat beloning-robuuste RLHF benaderingen de stabiliteit van constante beloningsinstellingen benaderen, wat effectief blijkt te zijn in een stochastische-case analyse. Samen benadrukken deze bijdragen het potentieel van het framework om zowel de prestaties als de stabiliteit van LLM-afstemming met RLHF te verbeteren.

English

As Large Language Models (LLMs) continue to progress toward more advanced forms of intelligence, Reinforcement Learning from Human Feedback (RLHF) is increasingly seen as a key pathway toward achieving Artificial General Intelligence (AGI). However, the reliance on reward-model-based (RM-based) alignment methods introduces significant challenges due to the inherent instability and imperfections of Reward Models (RMs), which can lead to critical issues such as reward hacking and misalignment with human intentions. In this paper, we introduce a reward-robust RLHF framework aimed at addressing these fundamental challenges, paving the way for more reliable and resilient learning in LLMs. Our approach introduces a novel optimization objective that carefully balances performance and robustness by incorporating Bayesian Reward Model Ensembles (BRME) to model the uncertainty set of reward functions. This allows the framework to integrate both nominal performance and minimum reward signals, ensuring more stable learning even with imperfect reward models. Empirical results demonstrate that our framework consistently outperforms traditional RLHF across diverse benchmarks, showing improved accuracy and long-term stability. We also provide a theoretical analysis, demonstrating that reward-robust RLHF approaches the stability of constant reward settings, which proves to be effective in a stochastic-case analysis. Together, these contributions highlight the framework potential to enhance both the performance and stability of LLM alignment with RLHF.

Beloningsrobuuste RLHF in LLM's

Reward-Robust RLHF in LLMs

Samenvatting

Support