Ego2Web: 1인칭 시점 영상에 기반한 웹 에이전트 벤치마크

초록

멀티모달 AI 에이전트는 온라인 웹 실행을 포함하는 복잡한 실제 업무 흐름을 점점 더 자동화하고 있습니다. 그러나 현재의 웹 에이전트 벤치마크는 중요한 한계를 지니고 있습니다. 바로 웹 기반 상호작용과 인식에만 초점을 맞추어, 사용자의 실제 물리적 환경에 대한 기반(Grounding)이 부족하다는 점입니다. 이러한 한계는 에이전트가 예를 들어 AR 글래스 등을 통해 자기중심적 시각 인식(Egocentric Visual Perception)으로 사용자 주변의 객체를 인식한 후 관련 온라인 작업을 완료해야 하는 중요한 시나리오의 평가를 방해합니다. 이러한 격차를 해소하기 위해, 우리는 자기중심적 비디오 인식과 웹 에이전트 실행을 연결하는 최초의 벤치마크인 Ego2Web을 소개합니다. Ego2Web은 실제 세계의 1인칭 비디오 녹화를 시각적 이해, 웹 작업 계획, 온라인 환경 내 상호작용이 성공적 완수를 위해 필요한 웹 작업과 연결합니다. 우리는 자동 데이터 생성 파이프라인과 인간의 검증 및 정제 과정을 결합하여 이커머스, 미디어 검색, 지식 조회 등 다양한 웹 작업 유형에 걸쳐 잘 구성된 고품질의 비디오-작업 쌍을 구축했습니다. 또한 우리 벤치마크의 정확하고 확장 가능한 평가를 위해, 인간 판단과 약 84%의 일치율을 보이는(기존 평가 방법보다 상당히 높은) 새로운 LLM-as-a-Judge 자동 평가 방법인 Ego2WebJudge를 개발했습니다. Ego2Web에 대한 다양한 최첨단(SoTA) 에이전트 실험 결과, 그들의 성능은 모든 작업 범주에서 상당한 개선 여지가 있는 약한 수준임을 보여줍니다. 또한 작업 설계에 대한 포괄적인 Ablation Study를 수행하여, 제안된 작업에서 정확한 비디오 이해의 필요성과 현재 에이전트의 한계를 부각시켰습니다. 우리는 Ego2Web이 물리적 세계와 디지털 세계를 가로지르며 원활하게 보고, 이해하고, 행동할 수 있는 진정한 능력을 가진 AI 어시스턴트 개발을 위한 중요한 새로운 자원이 되기를 바랍니다.

English

Multimodal AI agents are increasingly automating complex real-world workflows that involve online web execution. However, current web-agent benchmarks suffer from a critical limitation: they focus entirely on web-based interaction and perception, lacking grounding in the user's real-world physical surroundings. This limitation prevents evaluation in crucial scenarios, such as when an agent must use egocentric visual perception (e.g., via AR glasses) to recognize an object in the user's surroundings and then complete a related task online. To address this gap, we introduce Ego2Web, the first benchmark designed to bridge egocentric video perception and web agent execution. Ego2Web pairs real-world first-person video recordings with web tasks that require visual understanding, web task planning, and interaction in an online environment for successful completion. We utilize an automatic data-generation pipeline combined with human verification and refinement to curate well-constructed, high-quality video-task pairs across diverse web task types, including e-commerce, media retrieval, knowledge lookup, etc. To facilitate accurate and scalable evaluation for our benchmark, we also develop a novel LLM-as-a-Judge automatic evaluation method, Ego2WebJudge, which achieves approximately 84% agreement with human judgment, substantially higher than existing evaluation methods. Experiments with diverse SoTA agents on our Ego2Web show that their performance is weak, with substantial headroom across all task categories. We also conduct a comprehensive ablation study on task design, highlighting the necessity of accurate video understanding in the proposed task and the limitations of current agents. We hope Ego2Web can be a critical new resource for developing truly capable AI assistants that can seamlessly see, understand, and act across the physical and digital worlds.

Ego2Web: 1인칭 시점 영상에 기반한 웹 에이전트 벤치마크

Ego2Web: A Web Agent Benchmark Grounded in Egocentric Videos

초록

Support