反射的木探索と自己学習による自律AIエージェントの向上

要旨

自律エージェントは、複雑な多段階意思決定タスクの自動化において著しい潜在能力を示しています。しかし、最先端のビジョン言語モデル（VLM）、例えばGPT-4oなどでも、特に複雑なWeb環境や長期計画タスクにおいては、人間レベルのパフォーマンスにはまだ及んでいません。これらの制限に対処するために、我々はReflective Monte Carlo Tree Search（R-MCTS）を導入します。これは新しいテスト時アルゴリズムであり、AIエージェント（例：GPT-4oによって強化されたもの）が、意思決定空間を即座に探索する能力を向上させるために設計されています。R-MCTSは、従来のMCTSを拡張することで、1）対照的な反射を組み込むことにより、エージェントが過去の相互作用から学び、探索効率を動的に向上させることができるようにし、2）信頼性のある状態評価を提供するために、複数エージェントの議論を使用します。さらに、R-MCTSが生成した木の走査を用いて、人間提供のラベルを使用せずにGPT-4oをセルフラーニングを通じて微調整することで、エージェントのパフォーマンスを向上させます。挑戦的なVisualWebArenaベンチマークでは、当社のGPT-4oベースのR-MCTSエージェントは、従来の最先端に比べて、さまざまなタスクで6％から30％の相対的な改善を達成します。さらに、テスト時の探索から得られた知識が、微調整を通じて効果的にGPT-4oに戻されることを示します。微調整されたGPT-4oは、テスト時の計算使用量を4分の1に減らしながら、R-MCTSのパフォーマンスの97％に匹敵します。さらに、定性的な結果から、微調整されたGPT-4oモデルは、環境を探索し、状態を評価し、現在の状態が成功につながらないと検出した場合には、有効な状態に戻る能力を示しています。さらに、我々の研究は、R-MCTSによるトレーニングデータ収集とテスト時の計算スケーリング特性を示しています。これらの結果は、テスト時の探索とセルフラーニングを通じて、VLMの推論および計画能力を向上させるための有望な研究方向を示唆しています。

English

Autonomous agents have demonstrated significant potential in automating complex multistep decision-making tasks. However, even state-of-the-art vision-language models (VLMs), such as GPT-4o, still fall short of human-level performance, particularly in intricate web environments and long-horizon planning tasks. To address these limitations, we introduce Reflective Monte Carlo Tree Search (R-MCTS), a novel test-time algorithm designed to enhance the ability of AI agents, e.g., powered by GPT-4o, to explore decision space on the fly. R-MCTS extends traditional MCTS by 1) incorporating contrastive reflection, allowing agents to learn from past interactions and dynamically improve their search efficiency; and 2) using multi-agent debate to provide reliable state evaluation. Moreover, we improve the agent's performance by fine-tuning GPT-4o through self-learning, using R-MCTS generated tree traversals without any human-provided labels. On the challenging VisualWebArena benchmark, our GPT-4o-based R-MCTS agent achieves a 6% to 30% relative improvement across various tasks compared to the previous state-of-the-art. Additionally, we show that the knowledge gained from test-time search can be effectively transferred back to GPT-4o via fine-tuning. The fine-tuned GPT-4o matches 97% of R-MCTS's performance while reducing compute usage by a factor of four at test time. Furthermore, qualitative results reveal that the fine-tuned GPT-4o model demonstrates the ability to explore the environment, evaluate a state, and backtrack to viable ones when it detects that the current state cannot lead to success. Moreover, our work demonstrates the compute scaling properties in both training - data collection with R-MCTS - and testing time. These results suggest a promising research direction to enhance VLMs' reasoning and planning capabilities for agentic applications via test-time search and self-learning.

反射的木探索と自己学習による自律AIエージェントの向上

Improving Autonomous AI Agents with Reflective Tree Search and Self-Learning

要旨

Support