ReST와 ReAct의 만남: 다단계 추론 LLM 에이전트를 위한 자기 개선

초록

복잡한 자연어 질문에 답하기 위해서는 다단계 추론과 외부 정보의 통합이 종종 필요합니다. 여러 시스템이 지식 검색과 대형 언어 모델(LLM)을 결합하여 이러한 질문에 답변해 왔습니다. 그러나 이러한 시스템은 다양한 실패 사례를 겪으며, 외부 지식과의 상호작용이 미분 불가능하기 때문에 이러한 실패를 직접적으로 종단 간 학습으로 수정할 수 없습니다. 이러한 문제를 해결하기 위해, 우리는 외부 지식에 대해 추론하고 행동할 수 있는 ReAct 스타일의 LLM 에이전트를 정의합니다. 또한, 이 에이전트를 ReST와 유사한 방법으로 개선하여, 이전 궤적에 대해 반복적으로 학습하고, AI 피드백을 활용한 점진적 배치 강화 학습을 통해 지속적인 자기 개선과 자기 증류를 수행합니다. 프롬프트된 대형 모델에서 시작하여 알고리즘을 단 두 번 반복한 후, 도전적인 구성적 질문-응답 벤치마크에서 비슷한 성능을 달성하면서 매개변수 수를 두 자릿수로 줄인 미세 조정된 소형 모델을 생성할 수 있습니다.

English

Answering complex natural language questions often necessitates multi-step reasoning and integrating external information. Several systems have combined knowledge retrieval with a large language model (LLM) to answer such questions. These systems, however, suffer from various failure cases, and we cannot directly train them end-to-end to fix such failures, as interaction with external knowledge is non-differentiable. To address these deficiencies, we define a ReAct-style LLM agent with the ability to reason and act upon external knowledge. We further refine the agent through a ReST-like method that iteratively trains on previous trajectories, employing growing-batch reinforcement learning with AI feedback for continuous self-improvement and self-distillation. Starting from a prompted large model and after just two iterations of the algorithm, we can produce a fine-tuned small model that achieves comparable performance on challenging compositional question-answering benchmarks with two orders of magnitude fewer parameters.

ReST와 ReAct의 만남: 다단계 추론 LLM 에이전트를 위한 자기 개선

ReST meets ReAct: Self-Improvement for Multi-Step Reasoning LLM Agent

초록

Support