OneVL: 비전-언어 설명을 통한 일단계 잠재 추론 및 계획

초록

체인 오브 씽크(CoT) 추론은 VLA 기반 자율주행에서 궤적 예측의 강력한 동력이 되었지만, 그 자동회귀적 특성으로 인한 지연 시간 비용이 실시간 적용에는 장벽으로 작용해왔습니다. 잠재적 CoT 방법은 추론 과정을 연속적인 은닉 상태로 압축하여 이 격차를 해소하려 시도하지만, 명시적 CoT 방법의 성능을 일관되게 따라가지 못했습니다. 우리는 이러한 현상이 순수하게 언어적인 잠재 표현이 운전을 실제로 지배하는 인과적 역학보다는 세계에 대한 상징적 추상화를 압축하기 때문이라고 판단합니다. 이에 우리는 이중 보조 디코더로 감독되는 간결한 잠재 토큰을 통해 추론을 라우팅하는 통합 VLA 및 월드 모델 프레임워크인 OneVL(비전-언어 설명을 통한 단일 단계 잠재 추론 및 계획)을 제안합니다. 텍스트 CoT를 재구성하는 언어 디코더와 함께, 미래 프레임 토큰을 예측하는 시각적 월드 모델 디코더를 도입하여 잠재 공간이 도로 기하학, 행위자 운동, 환경 변화의 인과적 역학을 내재화하도록 유도합니다. 3단계 학습 파이프라인은 이러한 잠재 변수를 궤적, 언어, 시각 목표에 점진적으로 정렬시켜 안정적인 공동 최적화를 보장합니다. 추론 시에는 보조 디코더를 제거하고 모든 잠재 토큰을 단일 병렬 패스로 미리 채워 답변 전용 예측과 동일한 속도를 달성합니다. 4개 벤치마크에서 OneVL은 명시적 CoT를 능가하는 첫 번째 잠재 CoT 방법으로 등극하여, 답변 전용 대기 시간으로 최첨단 정확도를 제공하며, 언어 및 월드 모델 감독을 함께 적용한 더 긴밀한 압축이 토큰별 세부 추론보다 일반화 가능한 표현을 생성한다는 직접적인 증거를 제시합니다. 프로젝트 페이지: https://xiaomi-embodied-intelligence.github.io/OneVL

English

Chain-of-Thought (CoT) reasoning has become a powerful driver of trajectory prediction in VLA-based autonomous driving, yet its autoregressive nature imposes a latency cost that is prohibitive for real-time deployment. Latent CoT methods attempt to close this gap by compressing reasoning into continuous hidden states, but consistently fall short of their explicit counterparts. We suggest that this is due to purely linguistic latent representations compressing a symbolic abstraction of the world, rather than the causal dynamics that actually govern driving. Thus, we present OneVL (One-step latent reasoning and planning with Vision-Language explanations), a unified VLA and World Model framework that routes reasoning through compact latent tokens supervised by dual auxiliary decoders. Alongside a language decoder that reconstructs text CoT, we introduce a visual world model decoder that predicts future-frame tokens, forcing the latent space to internalize the causal dynamics of road geometry, agent motion, and environmental change. A three-stage training pipeline progressively aligns these latents with trajectory, language, and visual objectives, ensuring stable joint optimization. At inference, the auxiliary decoders are discarded and all latent tokens are prefilled in a single parallel pass, matching the speed of answer-only prediction. Across four benchmarks, OneVL becomes the first latent CoT method to surpass explicit CoT, delivering state-of-the-art accuracy at answer-only latency, and providing direct evidence that tighter compression, when guided in both language and world-model supervision, produces more generalizable representations than verbose token-by-token reasoning. Project Page: https://xiaomi-embodied-intelligence.github.io/OneVL

OneVL: 비전-언어 설명을 통한 일단계 잠재 추론 및 계획

OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation

초록

Support