GeoWorld: 幾何学的世界モデル

要旨

エネルギーに基づく予測的世界モデルは、ピクセル生成ではなく潜在エネルギー地形上の推論によって、多段階の視覚的計画に対する強力なアプローチを提供する。しかし、既存の手法は二つの重大な課題に直面している：(i) それらの潜在表現は通常ユークリッド空間で学習され、状態間の基礎となる幾何学的および階層的構造を無視していること、(ii) 長期的な予測に苦戦し、長いロールアウトにおいて急速に性能が劣化することである。これらの課題に対処するため、本論文ではGeoWorldを提案する。これは双曲JEPAを通じて幾何学的構造と階層的関係を保持する幾何学的世界モデルであり、潜在表現をユークリッド空間から双曲多様体へ写像する。さらに、エネルギー基底最適化のための幾何学的強化学習を導入し、双曲潜在空間における安定した多段階計画を可能にする。CrossTaskとCOINにおける大規模な実験により、最先端のV-JEPA 2と比較して、3段階計画では約3%、4段階計画では約2%の成功率(SR)向上が実証された。プロジェクトウェブサイト: https://steve-zeyu-zhang.github.io/GeoWorld。

English

Energy-based predictive world models provide a powerful approach for multi-step visual planning by reasoning over latent energy landscapes rather than generating pixels. However, existing approaches face two major challenges: (i) their latent representations are typically learned in Euclidean space, neglecting the underlying geometric and hierarchical structure among states, and (ii) they struggle with long-horizon prediction, which leads to rapid degradation across extended rollouts. To address these challenges, we introduce GeoWorld, a geometric world model that preserves geometric structure and hierarchical relations through a Hyperbolic JEPA, which maps latent representations from Euclidean space onto hyperbolic manifolds. We further introduce Geometric Reinforcement Learning for energy-based optimization, enabling stable multi-step planning in hyperbolic latent space. Extensive experiments on CrossTask and COIN demonstrate around 3% SR improvement in 3-step planning and 2% SR improvement in 4-step planning compared to the state-of-the-art V-JEPA 2. Project website: https://steve-zeyu-zhang.github.io/GeoWorld.

GeoWorld: 幾何学的世界モデル

GeoWorld: Geometric World Models

要旨

Support