交互式世界模型的基准测试与统一行动生成框架

摘要

实现人工通用智能（AGI）需要具备自适应学习与交互能力的智能体，而交互式世界模型能为感知、推理和行动提供可扩展的环境。然而当前研究仍缺乏大规模数据集和统一基准来评估智能体的物理交互能力。为此，我们提出iWorld-Bench——一个用于训练和测试世界模型交互能力的综合基准，涵盖距离感知、记忆等功能。我们构建了包含33万条视频片段的多样化数据集，并筛选出2100个涵盖多视角、多天气和多场景的高质量样本。针对现有世界模型交互模式的差异性，我们引入行动生成框架以统一评估标准，设计了六类任务类型并生成4900个测试样本。这些任务共同评估模型在视觉生成、轨迹追踪和记忆等方面的表现。通过对14个代表性世界模型的评估，我们揭示了其关键局限性，并为未来研究提供了方向指引。iWorld-Bench模型排行榜已公开于iWorld-Bench.com网站。

English

Achieving Artificial General Intelligence (AGI) requires agents that learn and interact adaptively, with interactive world models providing scalable environments for perception, reasoning, and action. Yet current research still lacks large-scale datasets and unified benchmarks to evaluate their physical interaction capabilities. To address this, we propose iWorld-Bench, a comprehensive benchmark for training and testing world models on interaction-related abilities such as distance perception and memory. We construct a diverse dataset with 330k video clips and select 2.1k high-quality samples covering varied perspectives, weather, and scenes. As existing world models differ in interaction modalities, we introduce an Action Generation Framework to unify evaluation and design six task types, generating 4.9k test samples. These tasks jointly assess model performance across visual generation, trajectory following, and memory. Evaluating 14 representative world models, we identify key limitations and provide insights for future research. The iWorld-Bench model leaderboard is publicly available at iWorld-Bench.com.

交互式世界模型的基准测试与统一行动生成框架

A Benchmark for Interactive World Models with a Unified Action Generation Framework

摘要

Support