WILBUR：適應性內文學習，用於強大而準確的網路代理程式

摘要

在網路代理研究領域中，實現泛化和準確性兩者並存仍然是一個具有挑戰性的問題。由於網站結構變異性高，現有方法通常失敗。此外，現有的微調和上下文學習技術無法在多個網站間實現泛化。我們介紹了一種名為Wilbur的方法，該方法使用可微分排名模型和新穎的指令合成技術，以最佳方式填充黑盒大型語言模型的提示，使用來自先前運行的任務示範。為了最大化端到端成功率，我們還提出了一種智能回溯機制，該機制可以學習並從錯誤中恢復。最後，我們展示了我們的排名模型可以在從生成自動課程中採樣的數據上進行訓練，該自動課程從LLM中採樣代表性目標，運行代理，並自動評估，無需手動標註。Wilbur在WebVoyager基準測試中取得了最新成果，整體上比僅使用文本模型提高了8％，在某些網站上最高達36％。在同一基準測試中，儘管僅接收文本輸入，Wilbur與強大的多模型模型之間的差距不到5％，進一步分析顯示，大量失敗是由於操作網路的工程挑戰。

English

In the realm of web agent research, achieving both generalization and accuracy remains a challenging problem. Due to high variance in website structure, existing approaches often fail. Moreover, existing fine-tuning and in-context learning techniques fail to generalize across multiple websites. We introduce Wilbur, an approach that uses a differentiable ranking model and a novel instruction synthesis technique to optimally populate a black-box large language model's prompt with task demonstrations from previous runs. To maximize end-to-end success rates, we also propose an intelligent backtracking mechanism that learns and recovers from its mistakes. Finally, we show that our ranking model can be trained on data from a generative auto-curriculum which samples representative goals from an LLM, runs the agent, and automatically evaluates it, with no manual annotation. Wilbur achieves state-of-the-art results on the WebVoyager benchmark, beating text-only models by 8% overall, and up to 36% on certain websites. On the same benchmark, Wilbur is within 5% of a strong multi-modal model despite only receiving textual inputs, and further analysis reveals a substantial number of failures are due to engineering challenges of operating the web.

WILBUR：適應性內文學習，用於強大而準確的網路代理程式

WILBUR: Adaptive In-Context Learning for Robust and Accurate Web Agents

摘要

Support