Ego2Web：基于第一人称视角视频的网页智能体基准测试平台

摘要

多模态AI代理正日益自动化涉及在线网络执行的复杂现实工作流程。然而，当前网络代理基准测试存在一个关键局限：完全聚焦于基于网络的交互与感知，缺乏对用户现实物理环境的关联。这一局限导致无法评估关键场景，例如当代理需通过第一视角视觉感知（如通过AR眼镜）识别用户周边物体并完成相关在线任务时。为弥补这一空白，我们推出Ego2Web——首个连接第一视角视频感知与网络代理执行的基准测试。Ego2Web将现实世界第一人称视频记录与需要视觉理解、网络任务规划及在线环境交互的网络任务配对，确保任务成功完成。我们采用自动化数据生成流程结合人工验证优化，构建了涵盖电子商务、媒体检索、知识查询等多元网络任务类型的高质量视频-任务对。为实现精准可扩展的评估，我们还开发了新型LLM-as-a-Judge自动评估方法Ego2WebJudge，其与人类判断的一致性达84%，显著优于现有评估方法。在Ego2Web上对多种先进代理的测试表明，其性能表现较弱，所有任务类别均有大幅提升空间。我们还对任务设计进行了全面消融研究，揭示了精准视频理解在任务中的必要性以及当前代理的局限性。我们期待Ego2Web能成为开发真正智能AI助手的关键资源，助力实现物理与数字世界无缝衔接的感知、理解与行动。

English

Multimodal AI agents are increasingly automating complex real-world workflows that involve online web execution. However, current web-agent benchmarks suffer from a critical limitation: they focus entirely on web-based interaction and perception, lacking grounding in the user's real-world physical surroundings. This limitation prevents evaluation in crucial scenarios, such as when an agent must use egocentric visual perception (e.g., via AR glasses) to recognize an object in the user's surroundings and then complete a related task online. To address this gap, we introduce Ego2Web, the first benchmark designed to bridge egocentric video perception and web agent execution. Ego2Web pairs real-world first-person video recordings with web tasks that require visual understanding, web task planning, and interaction in an online environment for successful completion. We utilize an automatic data-generation pipeline combined with human verification and refinement to curate well-constructed, high-quality video-task pairs across diverse web task types, including e-commerce, media retrieval, knowledge lookup, etc. To facilitate accurate and scalable evaluation for our benchmark, we also develop a novel LLM-as-a-Judge automatic evaluation method, Ego2WebJudge, which achieves approximately 84% agreement with human judgment, substantially higher than existing evaluation methods. Experiments with diverse SoTA agents on our Ego2Web show that their performance is weak, with substantial headroom across all task categories. We also conduct a comprehensive ablation study on task design, highlighting the necessity of accurate video understanding in the proposed task and the limitations of current agents. We hope Ego2Web can be a critical new resource for developing truly capable AI assistants that can seamlessly see, understand, and act across the physical and digital worlds.

Ego2Web：基于第一人称视角视频的网页智能体基准测试平台

Ego2Web: A Web Agent Benchmark Grounded in Egocentric Videos

摘要

Support