当用户改变主意：长周期网页导航中可中断智能体的评估（注：此处"Interruptible Agents"译为"可中断智能体"，指在任务执行过程中能够响应用户中断指令的智能代理系统。长周期导航任务指需要多步骤完成的网页操作流程。）

摘要

随着大语言模型智能体从处理简短静态问题转向在动态环境中执行复杂长期任务，其在任务执行过程中处理用户中断（如追加需求或修正目标）的能力正成为实际部署的核心需求。然而现有基准测试大多假设智能体行为不受干扰，或仅在短期无约束的语言任务中研究中断现象。本文首次对长期、环境关联的网页导航任务中可中断智能体进行系统性研究，此类任务中的操作会引发持久性状态改变。我们形式化了三种现实中断类型（追加、修正、撤销），并推出InterruptBench——一个基于WebArena-Lite构建的基准测试集，通过在严格语义约束下合成高质量中断场景。借助统一的中断模拟框架，我们在单轮和多轮中断设置下评估了六种强力大语言模型内核，既分析其适应更新意图的有效性，也考察其从中途变更中恢复的效率。实验结果表明，对于性能强劲的大规模语言模型而言，在长期智能体任务中有效且高效地处理用户中断仍具挑战性。代码与数据集已发布于https://github.com/HenryPengZou/InterruptBench。

English

As LLM agents transition from short, static problem solving to executing complex, long-horizon tasks in dynamic environments, the ability to handle user interruptions, such as adding requirement or revising goals, during mid-task execution is becoming a core requirement for realistic deployment. However, existing benchmarks largely assume uninterrupted agent behavior or study interruptions only in short, unconstrained language tasks. In this paper, we present the first systematic study of interruptible agents in long-horizon, environmentally grounded web navigation tasks, where actions induce persistent state changes. We formalize three realistic interruption types, including addition, revision, and retraction, and introduce InterruptBench, a benchmark derived from WebArena-Lite that synthesizes high-quality interruption scenarios under strict semantic constraints. Using a unified interruption simulation framework, we evaluate six strong LLM backbones across single- and multi-turn interruption settings, analyzing both their effectiveness in adapting to updated intents and their efficiency in recovering from mid-task changes. Our results show that handling user interruptions effectively and efficiently during long-horizon agentic tasks remains challenging for powerful large-scale LLMs. Code and dataset are available at https://github.com/HenryPengZou/InterruptBench.