Omni-WorldBench: Auf dem Weg zu einer umfassenden interaktionszentrierten Evaluation für Weltmodelle

Zusammenfassung

Video-basierte Weltmodelle haben sich entlang zweier dominanter Paradigmen entwickelt: Videogenerierung und 3D-Rekonstruktion. Bestehende Evaluierungs-Benchmarks konzentrieren sich jedoch entweder eng auf visuelle Qualität und Text-Video-Übereinstimmung für generative Modelle oder stützen sich auf statische 3D-Rekonstruktionsmetriken, die zeitliche Dynamiken grundlegend vernachlässigen. Wir vertreten die Ansicht, dass die Zukunft der Weltmodellierung in der 4D-Generierung liegt, die räumliche Struktur und zeitliche Entwicklung gemeinsam modelliert. In diesem Paradigma ist die zentrale Fähigkeit die interaktive Response: die Fähigkeit, treu widerzuspiegeln, wie Interaktionsaktionen Zustandsübergänge über Raum und Zeit hinweg antreiben. Dennoch bewertet kein bestehender Benchmark diese kritische Dimension systematisch. Um diese Lücke zu schließen, schlagen wir Omni-WorldBench vor, einen umfassenden Benchmark, der speziell zur Bewertung der interaktiven Response-Fähigkeiten von Weltmodellen in 4D-Szenarien entwickelt wurde. Omni-WorldBench umfasst zwei Schlüsselkomponenten: Omni-WorldSuite, eine systematische Prompt-Suite, die diverse Interaktionslevel und Szenentypen abdeckt; und Omni-Metrics, ein agentenbasiertes Evaluierungsframework, das Weltmodellierungsfähigkeiten quantifiziert, indem es die kausale Wirkung von Interaktionsaktionen auf sowohl Endergebnisse als auch intermediäre Zustandsentwicklungspfade misst. Wir führen umfangreiche Evaluierungen von 18 repräsentativen Weltmodellen über mehrere Paradigmen hinweg durch. Unsere Analyse deckt kritische Limitationen aktueller Weltmodelle in der interaktiven Response auf und liefert handlungsorientierte Erkenntnisse für zukünftige Forschung. Omni-WorldBench wird öffentlich zugänglich gemacht, um Fortschritte in der interaktiven 4D-Weltmodellierung zu fördern.

English

Video--based world models have emerged along two dominant paradigms: video generation and 3D reconstruction. However, existing evaluation benchmarks either focus narrowly on visual fidelity and text--video alignment for generative models, or rely on static 3D reconstruction metrics that fundamentally neglect temporal dynamics. We argue that the future of world modeling lies in 4D generation, which jointly models spatial structure and temporal evolution. In this paradigm, the core capability is interactive response: the ability to faithfully reflect how interaction actions drive state transitions across space and time. Yet no existing benchmark systematically evaluates this critical dimension. To address this gap, we propose Omni--WorldBench, a comprehensive benchmark specifically designed to evaluate the interactive response capabilities of world models in 4D settings. Omni--WorldBench comprises two key components: Omni--WorldSuite, a systematic prompt suite spanning diverse interaction levels and scene types; and Omni--Metrics, an agent-based evaluation framework that quantifies world modeling capabilities by measuring the causal impact of interaction actions on both final outcomes and intermediate state evolution trajectories. We conduct extensive evaluations of 18 representative world models across multiple paradigms. Our analysis reveals critical limitations of current world models in interactive response, providing actionable insights for future research. Omni-WorldBench will be publicly released to foster progress in interactive 4D world modeling.

Omni-WorldBench: Auf dem Weg zu einer umfassenden interaktionszentrierten Evaluation für Weltmodelle

Omni-WorldBench: Towards a Comprehensive Interaction-Centric Evaluation for World Models

Zusammenfassung

Support