大规模高效探索

摘要

我们开发了一种在线学习算法，显著提高了人类反馈强化学习（RLHF）的数据效率。该算法能够在接收选择数据时同步更新奖励模型和语言模型：奖励模型通过拟合选择数据来优化，语言模型则采用改进版REINFORCE算法进行更新，其强化信号由奖励模型提供。实现效率提升的关键技术包括：在每个强化信号中引入微小正向激励、采用认知神经网络建模奖励不确定性，以及实施信息导向探索。基于Gemma大语言模型（LLM）的实验表明，我们的算法仅用不到2万条标注数据即可达到传统离线RLHF使用20万条标注数据的性能，实现超过10倍的数据效率提升。根据结果推算，使用100万条标注数据训练的算法可匹配传统离线RLHF使用10亿条标注数据的效果，相当于实现1000倍的效率提升。据我们所知，这是首次实证证明如此大幅度的改进具有可行性。

English

We develop an online learning algorithm that dramatically improves the data efficiency of reinforcement learning from human feedback (RLHF). Our algorithm incrementally updates reward and language models as choice data is received. The reward model is fit to the choice data, while the language model is updated by a variation of reinforce, with reinforcement signals provided by the reward model. Several features enable the efficiency gains: a small affirmative nudge added to each reinforcement signal, an epistemic neural network that models reward uncertainty, and information-directed exploration. With Gemma large language models (LLMs), our algorithm matches the performance of offline RLHF trained on 200K labels using fewer than 20K labels, representing more than a 10x gain in data efficiency. Extrapolating from our results, we expect our algorithm trained on 1M labels to match offline RLHF trained on 1B labels. This represents a 1,000x gain. To our knowledge, these are the first results to demonstrate that such large improvements are possible.

大规模高效探索

Efficient Exploration at Scale

摘要

Support