大规模高效探索

摘要

我们开发了一种在线学习算法，显著提升了基于人类反馈的强化学习（RLHF）的数据效率。该算法能够在接收选择数据时同步更新奖励模型和语言模型：奖励模型通过拟合选择数据获得，语言模型则采用改进版REINFORCE算法进行更新，其强化信号由奖励模型提供。实现效率提升的关键技术包括：为每个强化信号添加微小正向激励、采用模拟奖励不确定性的认知神经网络，以及信息导向的探索机制。使用Gemma大语言模型（LLM）进行实验表明，我们的算法仅需不到2万条标注数据即可达到传统离线RLHF使用20万条标注数据的性能水平，实现了超过10倍的数据效率提升。根据结果外推，我们预计使用100万条标注数据训练的算法可媲美传统离线RLHF使用10亿条标注数据的表现，相当于实现1000倍的效率提升。据我们所知，这是首次实证证明如此大幅度的改进具有可行性。

English

We develop an online learning algorithm that dramatically improves the data efficiency of reinforcement learning from human feedback (RLHF). Our algorithm incrementally updates reward and language models as choice data is received. The reward model is fit to the choice data, while the language model is updated by a variation of reinforce, with reinforcement signals provided by the reward model. Several features enable the efficiency gains: a small affirmative nudge added to each reinforcement signal, an epistemic neural network that models reward uncertainty, and information-directed exploration. With Gemma large language models (LLMs), our algorithm matches the performance of offline RLHF trained on 200K labels using fewer than 20K labels, representing more than a 10x gain in data efficiency. Extrapolating from our results, we expect our algorithm trained on 1M labels to match offline RLHF trained on 1B labels. This represents a 1,000x gain. To our knowledge, these are the first results to demonstrate that such large improvements are possible.