FocusAgent：精简网页智能体大上下文的有效而简洁之道

摘要

基于大型语言模型（LLMs）的网页代理在处理用户目标时，必须解析冗长的网页观察数据；这些页面往往包含数万乃至更多的标记。这不仅会耗尽上下文限制，还增加了计算成本；此外，处理完整页面使代理面临如提示注入等安全风险。现有的剪枝策略要么舍弃了相关内容，要么保留了无关上下文，导致动作预测效果欠佳。我们提出了FocusAgent，一种简单而有效的方法，它利用轻量级LLM检索器，根据任务目标从可访问性树（AxTree）观察中提取最相关的行。通过剔除噪声和无关内容，FocusAgent在提升推理效率的同时，降低了遭受注入攻击的脆弱性。在WorkArena和WebArena基准测试中的实验表明，FocusAgent在保持强大基线性能的同时，将观察规模缩减了超过50%。此外，FocusAgent的一个变体显著降低了提示注入攻击的成功率，包括横幅和弹窗攻击，同时在无攻击环境下维持了任务完成性能。我们的研究结果表明，基于LLM的定向检索是构建高效、有效且安全的网页代理的一种实用且稳健的策略。

English

Web agents powered by large language models (LLMs) must process lengthy web page observations to complete user goals; these pages often exceed tens of thousands of tokens. This saturates context limits and increases computational cost processing; moreover, processing full pages exposes agents to security risks such as prompt injection. Existing pruning strategies either discard relevant content or retain irrelevant context, leading to suboptimal action prediction. We introduce FocusAgent, a simple yet effective approach that leverages a lightweight LLM retriever to extract the most relevant lines from accessibility tree (AxTree) observations, guided by task goals. By pruning noisy and irrelevant content, FocusAgent enables efficient reasoning while reducing vulnerability to injection attacks. Experiments on WorkArena and WebArena benchmarks show that FocusAgent matches the performance of strong baselines, while reducing observation size by over 50%. Furthermore, a variant of FocusAgent significantly reduces the success rate of prompt-injection attacks, including banner and pop-up attacks, while maintaining task success performance in attack-free settings. Our results highlight that targeted LLM-based retrieval is a practical and robust strategy for building web agents that are efficient, effective, and secure.

FocusAgent：精简网页智能体大上下文的有效而简洁之道

FocusAgent: Simple Yet Effective Ways of Trimming the Large Context of Web Agents

摘要

Support