当用户改变主意：长周期网页导航中可中断智能体的评估研究

摘要

随着智能体从处理短期静态问题转向在动态环境中执行复杂长期任务，其在任务执行过程中应对用户中断（如新增需求或修正目标）的能力正成为现实部署的核心需求。然而现有基准测试大多假设智能体行为不受干扰，或仅在简短的无约束语言任务中研究中断现象。本文首次针对长期、环境关联的网络导航任务（其操作会引发持续性状态变化）中的可中断智能体展开系统性研究。我们形式化了三种现实中断类型（新增、修正与撤销），并推出InterruptBench——一个源自WebArena-Lite的基准测试集，通过在严格语义约束下合成高质量中断场景。借助统一的中断模拟框架，我们评估了六种强大语言模型在单轮及多轮中断场景下的表现，既分析其适应更新意图的有效性，也考察其应对任务中途变更的效率。实验结果表明，对于高性能大语言模型而言，在长期任务执行过程中有效且高效地处理用户中断仍具挑战。代码与数据集已开源：https://github.com/HenryPengZou/InterruptBench。

English

As LLM agents transition from short, static problem solving to executing complex, long-horizon tasks in dynamic environments, the ability to handle user interruptions, such as adding requirement or revising goals, during mid-task execution is becoming a core requirement for realistic deployment. However, existing benchmarks largely assume uninterrupted agent behavior or study interruptions only in short, unconstrained language tasks. In this paper, we present the first systematic study of interruptible agents in long-horizon, environmentally grounded web navigation tasks, where actions induce persistent state changes. We formalize three realistic interruption types, including addition, revision, and retraction, and introduce InterruptBench, a benchmark derived from WebArena-Lite that synthesizes high-quality interruption scenarios under strict semantic constraints. Using a unified interruption simulation framework, we evaluate six strong LLM backbones across single- and multi-turn interruption settings, analyzing both their effectiveness in adapting to updated intents and their efficiency in recovering from mid-task changes. Our results show that handling user interruptions effectively and efficiently during long-horizon agentic tasks remains challenging for powerful large-scale LLMs. Code and dataset are available at https://github.com/HenryPengZou/InterruptBench.

当用户改变主意：长周期网页导航中可中断智能体的评估研究

When Users Change Their Mind: Evaluating Interruptible Agents in Long-Horizon Web Navigation

摘要

Support