MA-RLHF:利用宏操作从人类反馈中进行强化学习
MA-RLHF: Reinforcement Learning from Human Feedback with Macro Actions
October 3, 2024
作者: Yekun Chai, Haoran Sun, Huang Fang, Shuohuan Wang, Yu Sun, Hua Wu
cs.AI
摘要
人类反馈强化学习(RLHF)已经证明在将大型语言模型(LLMs)与人类偏好对齐方面非常有效。然而,基于标记的RLHF存在长序列上的信用分配问题,延迟奖励使模型难以确定哪些行动导致成功结果。这影响了学习效率并减慢了收敛速度。在本文中,我们提出了MA-RLHF,这是一个简单而有效的RLHF框架,它将宏观动作——标记序列或更高级别的语言结构——纳入学习过程中。通过在这个更高的抽象级别上操作,我们的方法减少了行动和奖励之间的时间距离,促进了更快更准确的信用分配。这导致了更稳定的策略梯度估计,并增强了每个episode内的学习效率,而无需在训练或推理过程中增加计算复杂性。我们通过在各种模型大小和任务上进行广泛实验来验证我们的方法,包括文本摘要、对话生成、问答和程序合成。我们的方法在文本摘要和代码生成方面性能提升高达30%,在对话方面提升18%,在问答任务中提升8%。值得注意的是,我们的方法在训练时间方面比普通RLHF快1.7倍至2倍,并在进一步训练中继续胜过它。我们将在https://github.com/ernie-research/MA-RLHF 上公开我们的代码和数据。
English
Reinforcement learning from human feedback (RLHF) has demonstrated
effectiveness in aligning large language models (LLMs) with human preferences.
However, token-level RLHF suffers from the credit assignment problem over long
sequences, where delayed rewards make it challenging for the model to discern
which actions contributed to successful outcomes. This hinders learning
efficiency and slows convergence. In this paper, we propose MA-RLHF, a simple
yet effective RLHF framework that incorporates macro actions -- sequences of
tokens or higher-level language constructs -- into the learning process. By
operating at this higher level of abstraction, our approach reduces the
temporal distance between actions and rewards, facilitating faster and more
accurate credit assignment. This results in more stable policy gradient
estimates and enhances learning efficiency within each episode, all without
increasing computational complexity during training or inference. We validate
our approach through extensive experiments across various model sizes and
tasks, including text summarization, dialogue generation, question answering,
and program synthesis. Our method achieves substantial performance improvements
over standard RLHF, with performance gains of up to 30% in text summarization
and code generation, 18% in dialogue, and 8% in question answering tasks.
Notably, our approach reaches parity with vanilla RLHF 1.7x to 2x faster in
terms of training time and continues to outperform it with further training. We
will make our code and data publicly available at
https://github.com/ernie-research/MA-RLHF .Summary
AI-Generated Summary