BrowserAgent:以人類啟發的網頁瀏覽動作為基礎構建網路代理
BrowserAgent: Building Web Agents with Human-Inspired Web Browsing Actions
October 12, 2025
作者: Zhengbo Zhang, Zhiheng Lyu, Junhao Gong, Hongzhu Yi, Xinming Wang, Yuxuan Zhou, Jiabing Yang, Ping Nie, Yan Huang, Wenhu Chen
cs.AI
摘要
高效利用大型語言模型(LLMs)解決現實世界問題,越來越依賴於其與動態網絡環境互動及自主獲取外部信息的能力。儘管近期如Search-R1和WebDancer等研究在解決網絡任務上展現了強勁性能,但它們高度依賴額外工具將互動式網絡環境轉化為靜態文本內容。這與人類瀏覽行為形成對比,後者涉及與瀏覽器的多樣化互動,如滾動、點擊和輸入。本文提出BrowserAgent,一個更具互動性的代理,通過模仿人類瀏覽器操作來解決複雜任務。BrowserAgent直接通過Playwright對原始網頁進行操作,利用一系列預定義的瀏覽器動作。我們採用兩階段訓練(監督微調(SFT)和拒絕微調(RFT))來提升模型的泛化能力。儘管使用的訓練數據量遠少於Search-R1,BrowserAgent在不同開放問答任務上取得了更具競爭力的結果。此外,我們引入了一種顯式記憶機制,用於跨步驟存儲關鍵結論,進一步增強了模型在長時序任務中的推理能力。值得注意的是,BrowserAgent-7B在多跳問答任務如HotpotQA、2Wiki和Bamboogle上,相比Search-R1實現了約20%的提升。這些結果表明,BrowserAgent可以作為一個更先進的框架,用於構建更具互動性和可擴展性的網絡代理。
English
Efficiently solving real-world problems with LLMs increasingly hinges on
their ability to interact with dynamic web environments and autonomously
acquire external information. While recent research like Search-R1 and
WebDancer demonstrates strong performance in solving web tasks, they heavily
rely on additional tools to convert the interactive web environment into static
text content. This is in contrast to human browsing behaviors, which involve
diverse interactions with the browser, such as scrolling, clicking, and typing.
In this paper, we propose BrowserAgent, a more interactive agent that solves
complex tasks through human-inspired browser actions. BrowserAgent operates
directly on raw web pages via Playwright through a set of predefined browser
actions. We adopt a two-stage training (Supervised Fine-Tuning (SFT) and
Rejection Fine-Tuning (RFT)) to improve the model's generalization abilities.
Despite using significantly less training data than Search-R1, BrowserAgent
achieves more competitive results across different Open-QA tasks. Additionally,
we introduce an explicit memory mechanism to store key conclusions across
steps, further enhancing the model's reasoning capabilities for long-horizon
tasks. Notably, BrowserAgent-7B can achieve around 20\% improvement over
Search-R1 on multi-hop QA tasks like HotpotQA, 2Wiki, and Bamboogle. These
results indicate that BrowserAgent can serve as a more advanced framework for
more interactive and scalable web agents.