HERMES++：邁向統一駕駛世界模型，實現3D場景理解與生成

摘要

駕駛世界模型作為自動駕駛的關鍵技術，通過模擬環境動態發揮重要作用。然而現有方法主要聚焦於未來場景生成，往往忽略全面的3D場景理解。另一方面，大型語言模型雖展現出卓越的推理能力，卻無法預測未來的幾何演變，導致語義解釋與物理模擬之間存在顯著斷層。為彌合這一差距，我們提出HERMES++——一個將3D場景理解與未來幾何預測整合於統一框架的駕駛世界模型。我們通過協同設計應對這些任務的不同需求：首先，採用BEV表徵將多視角空間信息整合為兼容大型語言模型的結構；其次，引入LLM增強的世界查詢機制以促進理解分支的知識遷移；第三，設計當前至未來的鏈接橋接時間斷層，使幾何演變受語義上下文制約；最後，為確保結構完整性，我們採用聯合幾何優化策略，將顯式幾何約束與隱式潛在正則化相結合，使內部表徵與幾何感知先驗對齊。在多個基準測試上的廣泛評估驗證了方法的有效性：HERMES++在未來點雲預測和3D場景理解任務中均超越專業模型，表現優異。模型與代碼將公開於https://github.com/H-EmbodVis/HERMESV2。

English

Driving world models serve as a pivotal technology for autonomous driving by simulating environmental dynamics. However, existing approaches predominantly focus on future scene generation, often overlooking comprehensive 3D scene understanding. Conversely, while Large Language Models (LLMs) demonstrate impressive reasoning capabilities, they lack the capacity to predict future geometric evolution, creating a significant disparity between semantic interpretation and physical simulation. To bridge this gap, we propose HERMES++, a unified driving world model that integrates 3D scene understanding and future geometry prediction within a single framework. Our approach addresses the distinct requirements of these tasks through synergistic designs. First, a BEV representation consolidates multi-view spatial information into a structure compatible with LLMs. Second, we introduce LLM-enhanced world queries to facilitate knowledge transfer from the understanding branch. Third, a Current-to-Future Link is designed to bridge the temporal gap, conditioning geometric evolution on semantic context. Finally, to enforce structural integrity, we employ a Joint Geometric Optimization strategy that integrates explicit geometric constraints with implicit latent regularization to align internal representations with geometry-aware priors. Extensive evaluations on multiple benchmarks validate the effectiveness of our method. HERMES++ achieves strong performance, outperforming specialist approaches in both future point cloud prediction and 3D scene understanding tasks. The model and code will be publicly released at https://github.com/H-EmbodVis/HERMESV2.

HERMES++：邁向統一駕駛世界模型，實現3D場景理解與生成

HERMES++: Toward a Unified Driving World Model for 3D Scene Understanding and Generation

摘要

Support