Aether：几何感知的统一世界建模

摘要

几何重建与生成建模的融合，仍是开发具备类人空间推理能力AI系统的关键挑战。本文提出Aether，一个统一框架，通过联合优化三项核心能力，实现世界模型中的几何感知推理：(1) 四维动态重建，(2) 动作条件视频预测，以及(3) 目标条件视觉规划。通过任务交错的特征学习，Aether在重建、预测与规划目标间实现了协同知识共享。基于视频生成模型，我们的框架展现了前所未有的合成到真实泛化能力，尽管训练过程中从未接触真实世界数据。此外，得益于其内在的几何建模，我们的方法在动作跟随与重建任务中均实现了零样本泛化。值得注意的是，即便没有真实世界数据，其重建性能也远超领域专用模型。同时，Aether利用几何信息化的动作空间，将预测无缝转化为行动，实现了有效的自主轨迹规划。我们期望本工作能激励社区探索物理合理世界建模及其应用的新前沿。

English

The integration of geometric reconstruction and generative modeling remains a critical challenge in developing AI systems capable of human-like spatial reasoning. This paper proposes Aether, a unified framework that enables geometry-aware reasoning in world models by jointly optimizing three core capabilities: (1) 4D dynamic reconstruction, (2) action-conditioned video prediction, and (3) goal-conditioned visual planning. Through task-interleaved feature learning, Aether achieves synergistic knowledge sharing across reconstruction, prediction, and planning objectives. Building upon video generation models, our framework demonstrates unprecedented synthetic-to-real generalization despite never observing real-world data during training. Furthermore, our approach achieves zero-shot generalization in both action following and reconstruction tasks, thanks to its intrinsic geometric modeling. Remarkably, even without real-world data, its reconstruction performance far exceeds that of domain-specific models. Additionally, Aether leverages a geometry-informed action space to seamlessly translate predictions into actions, enabling effective autonomous trajectory planning. We hope our work inspires the community to explore new frontiers in physically-reasonable world modeling and its applications.

Aether：几何感知的统一世界建模

Aether: Geometric-Aware Unified World Modeling

摘要

Support