MA-RLHF: 매크로 액션을 활용한 인간 피드백으로부터의 강화 학습

초록

인간 피드백으로부터 강화 학습(Reinforcement learning from human feedback, RLHF)은 대규모 언어 모델(Large Language Models, LLMs)을 인간의 선호에 맞게 조정하는 효과를 입증했습니다. 그러나 토큰 수준의 RLHF는 장기 시퀀스에서의 신용 할당 문제로 인해 어려움을 겪습니다. 지연된 보상으로 인해 모델이 어떤 행동이 성공적인 결과에 기여했는지 구별하기 어려워집니다. 이는 학습 효율성을 저해하고 수렴 속도를 늦춥니다. 본 논문에서는 매크로 액션 - 토큰 시퀀스 또는 상위 수준의 언어 구조 - 를 학습 프로세스에 통합하는 간단하면서 효과적인 MA-RLHF 프레임워크를 제안합니다. 이러한 더 높은 추상화 수준에서 작동함으로써 저희 방법은 행동과 보상 사이의 시간적 거리를 줄여 더 빠르고 정확한 신용 할당을 가능하게 합니다. 이는 더 안정적인 정책 기울기 추정치를 초래하며, 각 에피소드 내에서 학습 효율성을 향상시키고, 훈련이나 추론 중에 계산 복잡성을 증가시키지 않습니다. 우리는 텍스트 요약, 대화 생성, 질문 응답, 프로그램 합성을 포함한 다양한 모델 크기와 작업을 통해 방법을 검증합니다. 우리의 방법은 텍스트 요약 및 코드 생성에서 최대 30%의 성능 향상, 대화에서 18%, 질문 응답 작업에서 8%의 성능 향상을 달성합니다. 특히, 우리의 방법은 훈련 시간을 기준으로 바닐라 RLHF보다 1.7배에서 2배 빠르게 동등한 수준에 도달하며, 추가 훈련을 통해 계속해서 능가합니다. 우리는 코드와 데이터를 https://github.com/ernie-research/MA-RLHF 에 공개할 예정입니다.

English

Reinforcement learning from human feedback (RLHF) has demonstrated effectiveness in aligning large language models (LLMs) with human preferences. However, token-level RLHF suffers from the credit assignment problem over long sequences, where delayed rewards make it challenging for the model to discern which actions contributed to successful outcomes. This hinders learning efficiency and slows convergence. In this paper, we propose MA-RLHF, a simple yet effective RLHF framework that incorporates macro actions -- sequences of tokens or higher-level language constructs -- into the learning process. By operating at this higher level of abstraction, our approach reduces the temporal distance between actions and rewards, facilitating faster and more accurate credit assignment. This results in more stable policy gradient estimates and enhances learning efficiency within each episode, all without increasing computational complexity during training or inference. We validate our approach through extensive experiments across various model sizes and tasks, including text summarization, dialogue generation, question answering, and program synthesis. Our method achieves substantial performance improvements over standard RLHF, with performance gains of up to 30% in text summarization and code generation, 18% in dialogue, and 8% in question answering tasks. Notably, our approach reaches parity with vanilla RLHF 1.7x to 2x faster in terms of training time and continues to outperform it with further training. We will make our code and data publicly available at https://github.com/ernie-research/MA-RLHF .

MA-RLHF: 매크로 액션을 활용한 인간 피드백으로부터의 강화 학습

MA-RLHF: Reinforcement Learning from Human Feedback with Macro Actions

초록

Support