MemoBench: 動的に変化する環境における世界モデリングのベンチマーク評価

要旨

動画生成モデルは動的環境をシミュレートすることを目指しており、現在いくつかのベンチマークがフレーム間のメモリ一貫性を評価している。しかし、ほとんどのベンチマークは対象が視界内にある間のみ一貫性を評価しており、物体を視野外に出すものは、遮蔽中に何も変化しない静的シーンを評価するに過ぎない。このギャップを埋めるために、我々はMemoBenchを導入する。これは動的に変化する環境における消失・再出現パラダイムに基づく診断用ベンチマークである。対象物体は物理的プロセスを経て視界から消失し、再出現時には更新された状態で正しく復元されなければならない。我々は合成シーンと実世界シーンにわたる360個の正解クリップを厳選し、四つの診断柱にわたって自動評価指標とVQAベースの評価を組み合わせた評価スイートを設計した。八つの最先端モデルの評価により、消失・再出現パラダイム下でのメモリ一貫性に関する重要な知見と未解決の課題が明らかになった。

English

Video generation models aspire to simulate dynamic environments, and several benchmarks now evaluate memory consistency across frames. However, most assess consistency only while the target remains in view, and the few that force objects out of view evaluate static scenes where nothing changes during occlusion. To bridge this gap, we introduce MemoBench, a diagnostic benchmark built around the disappear-and-reappear paradigm in dynamically changing environments: a target object undergoes a physical process, disappears from view, and must be correctly recovered in its updated state upon reappearance. We curate 360 ground-truth clips spanning synthetic and real-world scenes, and design an evaluation suite combining automated metrics with VQA-based assessment across four diagnostic pillars. Evaluation of eight state-of-the-art models reveals key insights and open challenges regarding memory consistency under the disappear-and-reappear paradigm.