OneVL：具視覺語言解釋的一步潛在推理與規劃

摘要

思維鏈推理已成為基於視覺語言模型的自主駕駛軌跡預測關鍵技術，但其自回歸特性會產生延遲成本，難以滿足實時部署需求。潛在思維鏈方法試圖通過將推理壓縮至連續隱藏狀態來縮小差距，但始終未能超越顯式推理方法。我們認為這是由於純語言潛在表徵壓縮的是世界的符號抽象，而非實際主導駕駛的因果動態。為此，我們提出OneVL（具備視覺語言解釋的一步式潛在推理與規劃），這是一個統一的視覺語言模型與世界模型框架，通過由雙輔助解碼器監督的緊湊潛在標記進行推理路由。除了重建文本思維鏈的語言解碼器，我們引入視覺世界模型解碼器來預測未來幀標記，迫使潛在空間內化道路幾何、智能體運動與環境變化的因果動態。三階段訓練流程逐步對齊軌跡、語言與視覺目標的潛在表徵，確保穩定的聯合優化。在推理階段，輔助解碼器被棄用，所有潛在標記通過單次並行處理預填充，實現僅生成答案的預測速度。在四項基準測試中，OneVL成為首個超越顯式思維鏈的潛在思維鏈方法，以僅生成答案的延遲達到最先進精度，並直接證明：當同時接受語言和世界模型監督時，更緊密的壓縮能產生比逐標記冗長推理更具泛化性的表徵。項目頁面：https://xiaomi-embodied-intelligence.github.io/OneVL

English

Chain-of-Thought (CoT) reasoning has become a powerful driver of trajectory prediction in VLA-based autonomous driving, yet its autoregressive nature imposes a latency cost that is prohibitive for real-time deployment. Latent CoT methods attempt to close this gap by compressing reasoning into continuous hidden states, but consistently fall short of their explicit counterparts. We suggest that this is due to purely linguistic latent representations compressing a symbolic abstraction of the world, rather than the causal dynamics that actually govern driving. Thus, we present OneVL (One-step latent reasoning and planning with Vision-Language explanations), a unified VLA and World Model framework that routes reasoning through compact latent tokens supervised by dual auxiliary decoders. Alongside a language decoder that reconstructs text CoT, we introduce a visual world model decoder that predicts future-frame tokens, forcing the latent space to internalize the causal dynamics of road geometry, agent motion, and environmental change. A three-stage training pipeline progressively aligns these latents with trajectory, language, and visual objectives, ensuring stable joint optimization. At inference, the auxiliary decoders are discarded and all latent tokens are prefilled in a single parallel pass, matching the speed of answer-only prediction. Across four benchmarks, OneVL becomes the first latent CoT method to surpass explicit CoT, delivering state-of-the-art accuracy at answer-only latency, and providing direct evidence that tighter compression, when guided in both language and world-model supervision, produces more generalizable representations than verbose token-by-token reasoning. Project Page: https://xiaomi-embodied-intelligence.github.io/OneVL

OneVL：具視覺語言解釋的一步潛在推理與規劃

OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation

摘要

Support