透過自動獎勵建模與規劃實現自主代理的規模化擴展
Scaling Autonomous Agents via Automatic Reward Modeling And Planning
February 17, 2025
作者: Zhenfang Chen, Delin Chen, Rui Sun, Wenjun Liu, Chuang Gan
cs.AI
摘要
大型語言模型(LLMs)在多種文本生成任務中展現了卓越的能力。然而,LLMs在需要多步決策和環境反饋的問題上仍存在困難,例如在線購物、科學推理和數學問題解決。與純文本數據不同,收集大規模的決策數據具有挑戰性。此外,許多強大的LLMs僅能通過API訪問,這由於成本和複雜性而阻礙了它們在代理任務上的微調。為了解決LLM代理的局限性,我們提出了一個框架,能夠自動從環境中學習獎勵模型,而無需人工註釋。該模型可用於評估LLM代理的行動軌跡,並為任務規劃提供啟發式方法。具體來說,我們的方法涉及使用一個基於LLM的代理在環境中隨機導航,生成多樣化的行動軌跡。隨後,利用另一個LLM為每個軌跡分配任務意圖,並合成一個負面回應以及正確的回應。這些三元組(任務意圖、正面回應和負面回應)隨後被用作訓練數據,以優化能夠對行動軌跡進行評分的獎勵模型。我們框架的有效性和通用性通過在不同代理基準上的評估得到了展示。總之,我們提出的框架在增強LLM代理的決策能力方面代表了重大進展。通過自動化獎勵模型的學習,我們克服了數據稀缺和API限制的挑戰,可能徹底改變LLMs在複雜和互動環境中的應用。這項研究為開發能夠應對需要多步決策的廣泛現實世界問題的更複雜AI代理鋪平了道路。
English
Large language models (LLMs) have demonstrated remarkable capabilities across
a range of text-generation tasks. However, LLMs still struggle with problems
requiring multi-step decision-making and environmental feedback, such as online
shopping, scientific reasoning, and mathematical problem-solving. Unlike pure
text data, collecting large-scale decision-making data is challenging.
Moreover, many powerful LLMs are only accessible through APIs, which hinders
their fine-tuning for agent tasks due to cost and complexity. To address LLM
agents' limitations, we propose a framework that can automatically learn a
reward model from the environment without human annotations. This model can be
used to evaluate the action trajectories of LLM agents and provide heuristics
for task planning. Specifically, our approach involves employing one LLM-based
agent to navigate an environment randomly, generating diverse action
trajectories. Subsequently, a separate LLM is leveraged to assign a task intent
and synthesize a negative response alongside the correct response for each
trajectory. These triplets (task intent, positive response, and negative
response) are then utilized as training data to optimize a reward model capable
of scoring action trajectories. The effectiveness and generalizability of our
framework are demonstrated through evaluations conducted on different agent
benchmarks. In conclusion, our proposed framework represents a significant
advancement in enhancing LLM agents' decision-making capabilities. By
automating the learning of reward models, we overcome the challenges of data
scarcity and API limitations, potentially revolutionizing the application of
LLMs in complex and interactive environments. This research paves the way for
more sophisticated AI agents capable of tackling a wide range of real-world
problems requiring multi-step decision-making.Summary
AI-Generated Summary