ReST meets ReAct：自我改進多步推理LLM代理程序

摘要

回答複雜的自然語言問題通常需要多步推理和整合外部信息。一些系統已將知識檢索與大型語言模型（LLM）結合起來，以回答此類問題。然而，這些系統存在各種失敗情況，我們無法直接對其進行端對端訓練以修復這些失敗，因為與外部知識的交互是不可微分的。為了解決這些缺陷，我們定義了一種具有推理和對外部知識採取行動能力的ReAct風格LLM代理。我們通過類似ReST的方法進一步完善代理，該方法在先前的軌跡上進行迭代訓練，採用增量批強化學習與人工智能反饋，以持續自我改進和自我蒸餾。從提示的大型模型開始，經過算法僅兩次迭代後，我們可以生成一個經過微調的小型模型，在具有兩個數量級更少參數的挑戰性組合問答基準上實現可比的性能。

English

Answering complex natural language questions often necessitates multi-step reasoning and integrating external information. Several systems have combined knowledge retrieval with a large language model (LLM) to answer such questions. These systems, however, suffer from various failure cases, and we cannot directly train them end-to-end to fix such failures, as interaction with external knowledge is non-differentiable. To address these deficiencies, we define a ReAct-style LLM agent with the ability to reason and act upon external knowledge. We further refine the agent through a ReST-like method that iteratively trains on previous trajectories, employing growing-batch reinforcement learning with AI feedback for continuous self-improvement and self-distillation. Starting from a prompted large model and after just two iterations of the algorithm, we can produce a fine-tuned small model that achieves comparable performance on challenging compositional question-answering benchmarks with two orders of magnitude fewer parameters.

ReST meets ReAct：自我改進多步推理LLM代理程序

ReST meets ReAct: Self-Improvement for Multi-Step Reasoning LLM Agent

摘要

Support