浏览器智能体:基于人类浏览行为构建的网络代理
BrowserAgent: Building Web Agents with Human-Inspired Web Browsing Actions
October 12, 2025
作者: Zhengbo Zhang, Zhiheng Lyu, Junhao Gong, Hongzhu Yi, Xinming Wang, Yuxuan Zhou, Jiabing Yang, Ping Nie, Yan Huang, Wenhu Chen
cs.AI
摘要
高效利用大语言模型(LLMs)解决现实世界问题,日益依赖于其与动态网络环境交互及自主获取外部信息的能力。尽管近期如Search-R1和WebDancer等研究在解决网络任务上展现了强劲性能,但它们严重依赖额外工具将交互式网络环境转化为静态文本内容,这与人类浏览行为形成鲜明对比,后者涉及滚动、点击、输入等多种浏览器交互。本文提出BrowserAgent,一种更具交互性的代理,通过模拟人类浏览器操作解决复杂任务。BrowserAgent直接通过Playwright对原始网页执行一系列预定义的浏览器操作。我们采用两阶段训练(监督微调(SFT)与拒绝微调(RFT))来提升模型的泛化能力。尽管训练数据远少于Search-R1,BrowserAgent在不同开放问答任务上取得了更具竞争力的结果。此外,我们引入显式记忆机制,跨步骤存储关键结论,进一步增强模型在长程任务中的推理能力。值得注意的是,BrowserAgent-7B在多跳问答任务如HotpotQA、2Wiki和Bamboogle上,相比Search-R1实现了约20%的提升。这些结果表明,BrowserAgent可作为更先进框架,支持更具交互性和可扩展性的网络代理。
English
Efficiently solving real-world problems with LLMs increasingly hinges on
their ability to interact with dynamic web environments and autonomously
acquire external information. While recent research like Search-R1 and
WebDancer demonstrates strong performance in solving web tasks, they heavily
rely on additional tools to convert the interactive web environment into static
text content. This is in contrast to human browsing behaviors, which involve
diverse interactions with the browser, such as scrolling, clicking, and typing.
In this paper, we propose BrowserAgent, a more interactive agent that solves
complex tasks through human-inspired browser actions. BrowserAgent operates
directly on raw web pages via Playwright through a set of predefined browser
actions. We adopt a two-stage training (Supervised Fine-Tuning (SFT) and
Rejection Fine-Tuning (RFT)) to improve the model's generalization abilities.
Despite using significantly less training data than Search-R1, BrowserAgent
achieves more competitive results across different Open-QA tasks. Additionally,
we introduce an explicit memory mechanism to store key conclusions across
steps, further enhancing the model's reasoning capabilities for long-horizon
tasks. Notably, BrowserAgent-7B can achieve around 20\% improvement over
Search-R1 on multi-hop QA tasks like HotpotQA, 2Wiki, and Bamboogle. These
results indicate that BrowserAgent can serve as a more advanced framework for
more interactive and scalable web agents.