利用反思树搜索和自学习提升自主AI代理程序

摘要

自主代理已经展示了在自动化复杂多步决策任务方面的巨大潜力。然而，即使是最先进的视觉语言模型（VLMs），如GPT-4o，在复杂网络环境和长期规划任务中仍然无法达到人类水平的表现。为了解决这些限制，我们引入了反思蒙特卡洛树搜索（R-MCTS），这是一种新颖的测试时间算法，旨在增强AI代理的能力，例如由GPT-4o驱动，以实时探索决策空间。R-MCTS通过以下方式扩展了传统MCTS：1）融入对比反思，使代理能够从过去的互动中学习并动态提高其搜索效率；2）使用多代理辩论来提供可靠的状态评估。此外，我们通过自学习对GPT-4o进行微调，使用R-MCTS生成的树遍历，而无需任何人工提供的标签，以提高代理的性能。在具有挑战性的VisualWebArena基准测试中，我们基于GPT-4o的R-MCTS代理相对于先前最先进技术，在各种任务中实现了6%至30%的相对改进。此外，我们展示了从测试时间搜索中获得的知识可以通过微调有效地转移到GPT-4o。经过微调的GPT-4o在测试时间将性能与R-MCTS的97%相匹配，同时将计算使用量减少了四倍。此外，定性结果显示，经过微调的GPT-4o模型展示了探索环境、评估状态以及在检测到当前状态无法成功时回溯到可行状态的能力。此外，我们的工作展示了在训练 - 使用R-MCTS进行数据收集 - 和测试时间中的计算扩展特性。这些结果表明了通过测试时间搜索和自学习来增强VLMs的推理和规划能力以用于代理应用的有前途的研究方向。

English

Autonomous agents have demonstrated significant potential in automating complex multistep decision-making tasks. However, even state-of-the-art vision-language models (VLMs), such as GPT-4o, still fall short of human-level performance, particularly in intricate web environments and long-horizon planning tasks. To address these limitations, we introduce Reflective Monte Carlo Tree Search (R-MCTS), a novel test-time algorithm designed to enhance the ability of AI agents, e.g., powered by GPT-4o, to explore decision space on the fly. R-MCTS extends traditional MCTS by 1) incorporating contrastive reflection, allowing agents to learn from past interactions and dynamically improve their search efficiency; and 2) using multi-agent debate to provide reliable state evaluation. Moreover, we improve the agent's performance by fine-tuning GPT-4o through self-learning, using R-MCTS generated tree traversals without any human-provided labels. On the challenging VisualWebArena benchmark, our GPT-4o-based R-MCTS agent achieves a 6% to 30% relative improvement across various tasks compared to the previous state-of-the-art. Additionally, we show that the knowledge gained from test-time search can be effectively transferred back to GPT-4o via fine-tuning. The fine-tuned GPT-4o matches 97% of R-MCTS's performance while reducing compute usage by a factor of four at test time. Furthermore, qualitative results reveal that the fine-tuned GPT-4o model demonstrates the ability to explore the environment, evaluate a state, and backtrack to viable ones when it detects that the current state cannot lead to success. Moreover, our work demonstrates the compute scaling properties in both training - data collection with R-MCTS - and testing time. These results suggest a promising research direction to enhance VLMs' reasoning and planning capabilities for agentic applications via test-time search and self-learning.

利用反思树搜索和自学习提升自主AI代理程序

Improving Autonomous AI Agents with Reflective Tree Search and Self-Learning

摘要

Support