Agent-R:通過迭代自我訓練訓練語言模型代理以反映
Agent-R: Training Language Model Agents to Reflect via Iterative Self-Training
January 20, 2025
作者: Siyu Yuan, Zehui Chen, Zhiheng Xi, Junjie Ye, Zhengyin Du, Jiecao Chen
cs.AI
摘要
大型語言模型(LLMs)代理在處理互動環境中的複雜任務方面變得日益重要。現有研究主要集中在通過從更強的專家進行行為克隆來提高性能,然而這類方法在現實應用中常常失敗,主要是由於無法從錯誤中恢復。然而,步級批評數據難以收集且成本高昂。因此,自動化並動態構建自我批評數據集對賦予模型智能代理能力至關重要。在這項工作中,我們提出了一個迭代自我訓練框架,名為Agent-R,它使語言代理能夠即時反思。Agent-R不同於傳統方法,該方法基於正確性獎勵或處罰行動,而是利用MCTS構建訓練數據,從錯誤的軌跡中恢復正確的軌跡。代理反思的一個關鍵挑戰在於及時修訂,而不是等到一次模擬結束。為了應對這一挑戰,我們引入了一個模型引導的批評構建機制:演員模型識別失敗軌跡中的第一個錯誤步驟(在其當前能力範圍內)。從該步驟開始,我們將其與相鄰的正確路徑拼接起來,這些路徑在樹中具有相同的父節點。這種策略使模型能夠根據其當前策略學習反思,從而實現更好的學習效率。為了進一步探索這種自我改進範式的可擴展性,我們研究了錯誤校正能力和數據集構建的迭代改進。我們的研究結果表明,Agent-R不斷提高了模型從錯誤中恢復的能力,並實現了及時的錯誤校正。在三個互動環境上的實驗表明,Agent-R有效地使代理能夠校正錯誤的行動,同時避免循環,實現了優於基準方法的性能(+5.59%)。
English
Large Language Models (LLMs) agents are increasingly pivotal for addressing
complex tasks in interactive environments. Existing work mainly focuses on
enhancing performance through behavior cloning from stronger experts, yet such
approaches often falter in real-world applications, mainly due to the inability
to recover from errors. However, step-level critique data is difficult and
expensive to collect. Automating and dynamically constructing self-critique
datasets is thus crucial to empowering models with intelligent agent
capabilities. In this work, we propose an iterative self-training framework,
Agent-R, that enables language Agent to Reflect on the fly. Unlike traditional
methods that reward or penalize actions based on correctness, Agent-R leverages
MCTS to construct training data that recover correct trajectories from
erroneous ones. A key challenge of agent reflection lies in the necessity for
timely revision rather than waiting until the end of a rollout. To address
this, we introduce a model-guided critique construction mechanism: the actor
model identifies the first error step (within its current capability) in a
failed trajectory. Starting from it, we splice it with the adjacent correct
path, which shares the same parent node in the tree. This strategy enables the
model to learn reflection based on its current policy, therefore yielding
better learning efficiency. To further explore the scalability of this
self-improvement paradigm, we investigate iterative refinement of both error
correction capabilities and dataset construction. Our findings demonstrate that
Agent-R continuously improves the model's ability to recover from errors and
enables timely error correction. Experiments on three interactive environments
show that Agent-R effectively equips agents to correct erroneous actions while
avoiding loops, achieving superior performance compared to baseline methods
(+5.59%).Summary
AI-Generated Summary