ChatPaper.aiChatPaper

WILBUR:适应性上下文学习,用于稳健和准确的网络代理

WILBUR: Adaptive In-Context Learning for Robust and Accurate Web Agents

April 8, 2024
作者: Michael Lutz, Arth Bohra, Manvel Saroyan, Artem Harutyunyan, Giovanni Campagna
cs.AI

摘要

在网络代理研究领域,实现泛化和准确性的双重目标仍然是一个具有挑战性的问题。由于网站结构的高变异性,现有方法经常失败。此外,现有的微调和上下文学习技术无法在多个网站之间实现泛化。我们引入了Wilbur,这是一种使用可微分排名模型和新颖的指令合成技术的方法,可以最优地填充黑盒大型语言模型的提示,使用来自先前运行的任务演示。为了最大化端到端的成功率,我们还提出了一种智能回溯机制,可以学习并从错误中恢复。最后,我们展示了我们的排名模型可以在生成式自动课程数据上进行训练,该数据从LLM中采样代表性目标,运行代理,并自动评估,无需手动注释。Wilbur在WebVoyager基准测试中取得了最先进的结果,整体上比仅文本模型高出8%,在某些网站上高达36%。在相同的基准测试中,尽管仅接收文本输入,Wilbur与强大的多模态模型之间的差距仅为5%,进一步分析显示,大量失败是由于操作网络的工程挑战。
English
In the realm of web agent research, achieving both generalization and accuracy remains a challenging problem. Due to high variance in website structure, existing approaches often fail. Moreover, existing fine-tuning and in-context learning techniques fail to generalize across multiple websites. We introduce Wilbur, an approach that uses a differentiable ranking model and a novel instruction synthesis technique to optimally populate a black-box large language model's prompt with task demonstrations from previous runs. To maximize end-to-end success rates, we also propose an intelligent backtracking mechanism that learns and recovers from its mistakes. Finally, we show that our ranking model can be trained on data from a generative auto-curriculum which samples representative goals from an LLM, runs the agent, and automatically evaluates it, with no manual annotation. Wilbur achieves state-of-the-art results on the WebVoyager benchmark, beating text-only models by 8% overall, and up to 36% on certain websites. On the same benchmark, Wilbur is within 5% of a strong multi-modal model despite only receiving textual inputs, and further analysis reveals a substantial number of failures are due to engineering challenges of operating the web.

Summary

AI-Generated Summary

PDF232December 15, 2024