ReST meets ReAct：用于多步推理LLM代理的自我改进

摘要

回答复杂的自然语言问题通常需要多步推理和整合外部信息。一些系统已经将知识检索与大型语言模型（LLM）相结合，以回答此类问题。然而，这些系统存在各种失败情况，我们无法直接端对端地训练它们来修复这些失败，因为与外部知识的交互是不可微分的。为了解决这些不足，我们定义了一种具有推理和对外部知识采取行动能力的ReAct风格LLM代理。我们通过一种类似ReST的方法进一步完善代理，该方法通过在先前轨迹上进行迭代训练，采用增量批强化学习与AI反馈进行持续自我改进和自我蒸馏。从一个提示的大型模型开始，在算法仅两次迭代之后，我们就能产生一个经过精细调整的小型模型，该模型在具有两个数量级更少参数的具有挑战性的组合式问答基准测试上实现了可比较的性能。

English

Answering complex natural language questions often necessitates multi-step reasoning and integrating external information. Several systems have combined knowledge retrieval with a large language model (LLM) to answer such questions. These systems, however, suffer from various failure cases, and we cannot directly train them end-to-end to fix such failures, as interaction with external knowledge is non-differentiable. To address these deficiencies, we define a ReAct-style LLM agent with the ability to reason and act upon external knowledge. We further refine the agent through a ReST-like method that iteratively trains on previous trajectories, employing growing-batch reinforcement learning with AI feedback for continuous self-improvement and self-distillation. Starting from a prompted large model and after just two iterations of the algorithm, we can produce a fine-tuned small model that achieves comparable performance on challenging compositional question-answering benchmarks with two orders of magnitude fewer parameters.

ReST meets ReAct：用于多步推理LLM代理的自我改进

ReST meets ReAct: Self-Improvement for Multi-Step Reasoning LLM Agent

摘要

Support