OneVL: 視覚言語による説明を伴う一段階潜在推論と計画

要旨

Chain-of-Thought（CoT）推論は、VLAベースの自律走行における軌道予測の強力な推進力となっているが、その自己回帰的な性質がもたらす遅延コストがリアルタイム展開の障壁となっている。潜在CoT手法は、推論を連続的な潜在状態に圧縮することでこのギャップを埋めようとするが、明示的CoTの性能には一貫して及ばない。我々は、この原因が純粋に言語的な潜在表現が、運転を実際に支配する因果的ダイナミクスではなく、世界の記号的抽象化を圧縮している点にあると考える。そこで本論文では、推論を双対の補助デコーダで監視されるコンパクトな潜在トークン経由で行う統合VLA・世界モデルフレームワーク「OneVL」を提案する。テキストCoTを再構築する言語デコーダに加えて、未来フレームのトークンを予測する視覚的世界モデルデコーダを導入し、潜在空間に道路構造・エージェント動作・環境変化の因果的ダイナミクスを内在化させる。3段階の訓練パイプラインにより、これらの潜在表現を軌道・言語・視覚の目的に段階的に整合させ、安定した共同最適化を実現する。推論時には補助デコーダを除去し、全ての潜在トークンを単一の並列処理で事前に埋めることで、回答のみの予測と同等の速度を実現する。4つのベンチマークにおいて、OneVLは潜在CoT手法として初めて明示的CoTを凌駕し、回答のみの遅延時間で最先端の精度を達成した。これは、言語と世界モデルの双方による監督の下では、冗長なトークン単位の推論よりも一般化性の高い表現が、より緊密な圧縮によって生み出されることを直接示唆している。プロジェクトページ: https://xiaomi-embodied-intelligence.github.io/OneVL

English

Chain-of-Thought (CoT) reasoning has become a powerful driver of trajectory prediction in VLA-based autonomous driving, yet its autoregressive nature imposes a latency cost that is prohibitive for real-time deployment. Latent CoT methods attempt to close this gap by compressing reasoning into continuous hidden states, but consistently fall short of their explicit counterparts. We suggest that this is due to purely linguistic latent representations compressing a symbolic abstraction of the world, rather than the causal dynamics that actually govern driving. Thus, we present OneVL (One-step latent reasoning and planning with Vision-Language explanations), a unified VLA and World Model framework that routes reasoning through compact latent tokens supervised by dual auxiliary decoders. Alongside a language decoder that reconstructs text CoT, we introduce a visual world model decoder that predicts future-frame tokens, forcing the latent space to internalize the causal dynamics of road geometry, agent motion, and environmental change. A three-stage training pipeline progressively aligns these latents with trajectory, language, and visual objectives, ensuring stable joint optimization. At inference, the auxiliary decoders are discarded and all latent tokens are prefilled in a single parallel pass, matching the speed of answer-only prediction. Across four benchmarks, OneVL becomes the first latent CoT method to surpass explicit CoT, delivering state-of-the-art accuracy at answer-only latency, and providing direct evidence that tighter compression, when guided in both language and world-model supervision, produces more generalizable representations than verbose token-by-token reasoning. Project Page: https://xiaomi-embodied-intelligence.github.io/OneVL

OneVL: 視覚言語による説明を伴う一段階潜在推論と計画

OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation

要旨

Support