ReST incontra ReAct: Auto-miglioramento per Agenti LLM con Ragionamento a Più Passi

Abstract

Rispondere a complesse domande in linguaggio naturale spesso richiede un ragionamento a più passi e l'integrazione di informazioni esterne. Diversi sistemi hanno combinato il recupero di conoscenze con un modello linguistico di grandi dimensioni (LLM) per rispondere a tali domande. Tuttavia, questi sistemi presentano vari casi di fallimento, e non possiamo addestrarli direttamente end-to-end per correggere tali errori, poiché l'interazione con conoscenze esterne non è differenziabile. Per affrontare queste carenze, definiamo un agente LLM in stile ReAct con la capacità di ragionare e agire su conoscenze esterne. Affiniamo ulteriormente l'agente attraverso un metodo simile a ReST che addestra iterativamente sulle traiettorie precedenti, impiegando un apprendimento per rinforzo a batch crescente con feedback AI per un miglioramento e una distillazione continua. Partendo da un modello di grandi dimensioni inizializzato e dopo solo due iterazioni dell'algoritmo, possiamo produrre un modello piccolo fine-tuned che raggiunge prestazioni comparabili su benchmark impegnativi di risposta a domande composizionali con due ordini di grandezza in meno di parametri.

English

Answering complex natural language questions often necessitates multi-step reasoning and integrating external information. Several systems have combined knowledge retrieval with a large language model (LLM) to answer such questions. These systems, however, suffer from various failure cases, and we cannot directly train them end-to-end to fix such failures, as interaction with external knowledge is non-differentiable. To address these deficiencies, we define a ReAct-style LLM agent with the ability to reason and act upon external knowledge. We further refine the agent through a ReST-like method that iteratively trains on previous trajectories, employing growing-batch reinforcement learning with AI feedback for continuous self-improvement and self-distillation. Starting from a prompted large model and after just two iterations of the algorithm, we can produce a fine-tuned small model that achieves comparable performance on challenging compositional question-answering benchmarks with two orders of magnitude fewer parameters.

ReST incontra ReAct: Auto-miglioramento per Agenti LLM con Ragionamento a Più Passi

ReST meets ReAct: Self-Improvement for Multi-Step Reasoning LLM Agent

Abstract

Support