Aether：具備幾何感知的統一世界建模

摘要

幾何重建與生成建模的整合仍然是開發具備類人空間推理能力的AI系統的關鍵挑戰。本文提出Aether，一個統一框架，通過聯合優化三大核心能力來實現世界模型中的幾何感知推理：(1) 四維動態重建，(2) 動作條件下的視頻預測，以及(3) 目標條件下的視覺規劃。通過任務交織的特徵學習，Aether在重建、預測和規劃目標之間實現了協同知識共享。基於視頻生成模型，我們的框架展示了前所未有的合成到真實的泛化能力，儘管在訓練過程中從未觀察過真實世界數據。此外，得益於其內在的幾何建模，我們的方法在動作跟隨和重建任務中均實現了零樣本泛化。值得注意的是，即使沒有真實世界數據，其重建性能也遠遠超過了特定領域的模型。此外，Aether利用幾何信息化的動作空間，將預測無縫轉化為動作，從而實現了有效的自主軌跡規劃。我們希望這項工作能激勵社區探索物理合理世界建模及其應用的新前沿。

English

The integration of geometric reconstruction and generative modeling remains a critical challenge in developing AI systems capable of human-like spatial reasoning. This paper proposes Aether, a unified framework that enables geometry-aware reasoning in world models by jointly optimizing three core capabilities: (1) 4D dynamic reconstruction, (2) action-conditioned video prediction, and (3) goal-conditioned visual planning. Through task-interleaved feature learning, Aether achieves synergistic knowledge sharing across reconstruction, prediction, and planning objectives. Building upon video generation models, our framework demonstrates unprecedented synthetic-to-real generalization despite never observing real-world data during training. Furthermore, our approach achieves zero-shot generalization in both action following and reconstruction tasks, thanks to its intrinsic geometric modeling. Remarkably, even without real-world data, its reconstruction performance far exceeds that of domain-specific models. Additionally, Aether leverages a geometry-informed action space to seamlessly translate predictions into actions, enabling effective autonomous trajectory planning. We hope our work inspires the community to explore new frontiers in physically-reasonable world modeling and its applications.

Aether：具備幾何感知的統一世界建模

Aether: Geometric-Aware Unified World Modeling

摘要

Support