FocusAgent：簡潔而高效的網頁代理大上下文精簡方法

摘要

基於大型語言模型（LLMs）的網路代理在完成用戶目標時，必須處理冗長的網頁觀察數據；這些頁面通常超過數萬個標記。這不僅會飽和上下文限制，還增加了計算成本；此外，處理完整頁面會使代理面臨如提示注入等安全風險。現有的修剪策略要麼丟失相關內容，要麼保留無關上下文，導致次優的行動預測。我們提出了FocusAgent，這是一種簡單而有效的方法，利用輕量級LLM檢索器從可訪問性樹（AxTree）觀察中提取最相關的行，並以任務目標為指導。通過修剪噪聲和無關內容，FocusAgent實現了高效推理，同時降低了對注入攻擊的脆弱性。在WorkArena和WebArena基準測試中的實驗表明，FocusAgent與強基線的性能相當，同時將觀察大小減少了50%以上。此外，FocusAgent的一個變體顯著降低了提示注入攻擊的成功率，包括橫幅和彈出攻擊，同時在無攻擊環境中保持任務成功性能。我們的結果強調，基於LLM的定向檢索是一種實用且穩健的策略，用於構建高效、有效且安全的網路代理。

English

Web agents powered by large language models (LLMs) must process lengthy web page observations to complete user goals; these pages often exceed tens of thousands of tokens. This saturates context limits and increases computational cost processing; moreover, processing full pages exposes agents to security risks such as prompt injection. Existing pruning strategies either discard relevant content or retain irrelevant context, leading to suboptimal action prediction. We introduce FocusAgent, a simple yet effective approach that leverages a lightweight LLM retriever to extract the most relevant lines from accessibility tree (AxTree) observations, guided by task goals. By pruning noisy and irrelevant content, FocusAgent enables efficient reasoning while reducing vulnerability to injection attacks. Experiments on WorkArena and WebArena benchmarks show that FocusAgent matches the performance of strong baselines, while reducing observation size by over 50%. Furthermore, a variant of FocusAgent significantly reduces the success rate of prompt-injection attacks, including banner and pop-up attacks, while maintaining task success performance in attack-free settings. Our results highlight that targeted LLM-based retrieval is a practical and robust strategy for building web agents that are efficient, effective, and secure.

FocusAgent：簡潔而高效的網頁代理大上下文精簡方法

FocusAgent: Simple Yet Effective Ways of Trimming the Large Context of Web Agents

摘要

Support