OneVL：基于视觉语言解释的单步潜在推理与规划

摘要

思维链推理已成为基于视觉语言模型的自动驾驶轨迹预测的重要推动力，但其自回归特性带来的延迟成本难以满足实时部署需求。潜在思维链方法试图通过将推理过程压缩至连续隐状态来弥补这一差距，但始终未能超越显式推理方法的性能。我们认为这是由于纯语言化的潜在表征仅压缩了世界的符号化抽象，而非真正支配驾驶行为的因果动态。为此，我们提出OneVL（具有视觉语言解释的单步潜在推理与规划），这是一个统一化的视觉语言模型与世界模型框架，通过受双辅助解码器监督的紧凑潜在令牌来引导推理流程。除了重构文本思维链的语言解码器外，我们引入了视觉世界模型解码器来预测未来帧令牌，迫使潜在空间内化道路几何、智能体运动与环境变化的因果动态。三阶段训练流程逐步将这些潜在表征与轨迹、语言及视觉目标对齐，确保稳定的联合优化。在推理阶段，辅助解码器被舍弃，所有潜在令牌通过单次并行前馈完成预填充，实现与纯答案预测相当的速度。在四个基准测试中，OneVL成为首个超越显式思维链的潜在推理方法，在保持纯答案预测延迟的同时实现了最先进的精度，并直接证明：当同时接受语言和世界模型监督指导时，更紧密的压缩比逐令牌的冗长推理能产生更具泛化性的表征。项目页面：https://xiaomi-embodied-intelligence.github.io/OneVL

English

Chain-of-Thought (CoT) reasoning has become a powerful driver of trajectory prediction in VLA-based autonomous driving, yet its autoregressive nature imposes a latency cost that is prohibitive for real-time deployment. Latent CoT methods attempt to close this gap by compressing reasoning into continuous hidden states, but consistently fall short of their explicit counterparts. We suggest that this is due to purely linguistic latent representations compressing a symbolic abstraction of the world, rather than the causal dynamics that actually govern driving. Thus, we present OneVL (One-step latent reasoning and planning with Vision-Language explanations), a unified VLA and World Model framework that routes reasoning through compact latent tokens supervised by dual auxiliary decoders. Alongside a language decoder that reconstructs text CoT, we introduce a visual world model decoder that predicts future-frame tokens, forcing the latent space to internalize the causal dynamics of road geometry, agent motion, and environmental change. A three-stage training pipeline progressively aligns these latents with trajectory, language, and visual objectives, ensuring stable joint optimization. At inference, the auxiliary decoders are discarded and all latent tokens are prefilled in a single parallel pass, matching the speed of answer-only prediction. Across four benchmarks, OneVL becomes the first latent CoT method to surpass explicit CoT, delivering state-of-the-art accuracy at answer-only latency, and providing direct evidence that tighter compression, when guided in both language and world-model supervision, produces more generalizable representations than verbose token-by-token reasoning. Project Page: https://xiaomi-embodied-intelligence.github.io/OneVL

OneVL：基于视觉语言解释的单步潜在推理与规划

OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation

摘要

Support