MA-RLHF：從人類反饋中學習的強化學習與宏觀動作

摘要

從人類反饋中學習的強化學習（RLHF）已經證明在對齊大型語言模型（LLMs）與人類偏好方面具有效性。然而，基於標記級別的RLHF存在長序列上的信用分配問題，延遲獎勵使模型難以識別哪些行動導致成功結果。這阻礙了學習效率並減慢了收斂速度。在本文中，我們提出了MA-RLHF，這是一個簡單而有效的RLHF框架，將宏觀行動——標記序列或更高層次的語言結構——納入學習過程中。通過在這更高的抽象層次上運作，我們的方法減少了行動和獎勵之間的時間距離，促進了更快和更準確的信用分配。這導致更穩定的策略梯度估計，增強了每個情節中的學習效率，而在訓練或推斷期間並未增加計算複雜度。我們通過在各種模型大小和任務上進行廣泛實驗來驗證我們的方法，包括文本摘要、對話生成、問答和程序合成。我們的方法在文本摘要和代碼生成方面實現了顯著的性能改進，分別達到30%、對話18%以及問答8%的性能增益。值得注意的是，我們的方法在訓練時間方面比普通RLHF快1.7倍至2倍，並在進一步訓練中繼續優於它。我們將在https://github.com/ernie-research/MA-RLHF 上公開我們的代碼和數據。

English

Reinforcement learning from human feedback (RLHF) has demonstrated effectiveness in aligning large language models (LLMs) with human preferences. However, token-level RLHF suffers from the credit assignment problem over long sequences, where delayed rewards make it challenging for the model to discern which actions contributed to successful outcomes. This hinders learning efficiency and slows convergence. In this paper, we propose MA-RLHF, a simple yet effective RLHF framework that incorporates macro actions -- sequences of tokens or higher-level language constructs -- into the learning process. By operating at this higher level of abstraction, our approach reduces the temporal distance between actions and rewards, facilitating faster and more accurate credit assignment. This results in more stable policy gradient estimates and enhances learning efficiency within each episode, all without increasing computational complexity during training or inference. We validate our approach through extensive experiments across various model sizes and tasks, including text summarization, dialogue generation, question answering, and program synthesis. Our method achieves substantial performance improvements over standard RLHF, with performance gains of up to 30% in text summarization and code generation, 18% in dialogue, and 8% in question answering tasks. Notably, our approach reaches parity with vanilla RLHF 1.7x to 2x faster in terms of training time and continues to outperform it with further training. We will make our code and data publicly available at https://github.com/ernie-research/MA-RLHF .

MA-RLHF：從人類反饋中學習的強化學習與宏觀動作

MA-RLHF: Reinforcement Learning from Human Feedback with Macro Actions

摘要

Support