CodeIt:具有优先顺位回放的自我改进语言模型
CodeIt: Self-Improving Language Models with Prioritized Hindsight Replay
February 7, 2024
作者: Natasha Butt, Blazej Manczak, Auke Wiggers, Corrado Rainone, David Zhang, Michaël Defferrard, Taco Cohen
cs.AI
摘要
大型语言模型越来越能够解决通常被认为需要人类水平推理能力的任务。然而,这些模型在诸如抽象和推理语料库(ARC)等智能总体基准测试中的表现仍然非常糟糕。在本文中,我们将ARC视为一个编程通过示例问题,并引入了一种名为代码迭代(CodeIt)的新颖且可扩展的语言模型自我改进方法。我们的方法在程序抽样和事后重新标记以及从优先经验重放中学习之间进行迭代。通过将一个情节的目标(即给定输入的目标程序输出)重新标记为抽样程序产生的实际输出,我们的方法有效地处理了程序合成中奖励的极端稀疏性。将CodeIt应用于ARC数据集,我们展示了优先事后重放,以及预训练和数据增强,导致成功的跨任务泛化。CodeIt是首个能够扩展到完整ARC评估数据集的神经符号方法。我们的方法解决了ARC评估任务中的15%,取得了最先进的性能,并优于现有的神经和符号基线。
English
Large language models are increasingly solving tasks that are commonly
believed to require human-level reasoning ability. However, these models still
perform very poorly on benchmarks of general intelligence such as the
Abstraction and Reasoning Corpus (ARC). In this paper, we approach ARC as a
programming-by-examples problem, and introduce a novel and scalable method for
language model self-improvement called Code Iteration (CodeIt). Our method
iterates between 1) program sampling and hindsight relabeling, and 2) learning
from prioritized experience replay. By relabeling the goal of an episode (i.e.,
the target program output given input) to the realized output produced by the
sampled program, our method effectively deals with the extreme sparsity of
rewards in program synthesis. Applying CodeIt to the ARC dataset, we
demonstrate that prioritized hindsight replay, along with pre-training and
data-augmentation, leads to successful inter-task generalization. CodeIt is the
first neuro-symbolic approach that scales to the full ARC evaluation dataset.
Our method solves 15% of ARC evaluation tasks, achieving state-of-the-art
performance and outperforming existing neural and symbolic baselines.