OneVL: Eénstaps Latente Redenering en Planning met Visueel-Taalmatige Uitleg

Samenvatting

Chain-of-Thought (CoT) redeneren is een krachtige aanjager geworden van trajectvoorspelling in VLA-gestuurd autonoom rijden, maar zijn autoregressieve karakter brengt latentiekosten met zich mee die onhaalbaar zijn voor real-time inzet. Latente CoT-methoden proberen deze kloof te dichten door redeneren te comprimeren in continue verborgen toestanden, maar blijven consequent achter bij hun expliciete tegenhangers. Wij suggereren dat dit komt doordat puur linguïstische latente representaties een symbolische abstractie van de wereld comprimeren, in plaats van de causale dynamiek die het rijgedrag daadwerkelijk bepaalt. Daarom presenteren wij OneVL (One-step latent reasoning and planning with Vision-Language explanations), een verenigd VLA- en Wereldmodelraamwerk dat redeneren routeert via compacte latente tokens onder supervisie van dubbele hulp-decoders. Naast een taaldecoder die tekstuele CoT reconstrueert, introduceren wij een visuele wereldmodel-decoder die toekomstige frame-tokens voorspelt, waardoor de latente ruimte wordt gedwongen de causale dynamiek van weggeometrie, agentbeweging en omgevingsverandering te internaliseren. Een driestaps trainingspijplijn aligneert deze latente representaties progressief met traject-, taal- en visuele doelstellingen, wat een stabiele gezamenlijke optimalisatie waarborgt. Tijdens inferentie worden de hulp-decoders verwijderd en worden alle latente tokens in één enkele parallelle stap vooringevuld, waardoor de snelheid van antwoord-alleen voorspelling wordt geëvenaard. Op vier benchmarks wordt OneVL de eerste latente CoT-methode die expliciete CoT overtreft, waarbij state-of-the-art nauwkeurigheid wordt geleverd tegen de latentie van antwoord-alleen voorspelling, en direct bewijs wordt geleverd dat strakkere compressie, mits begeleid door zowel taal- als wereldmodelsupervisie, meer generaliseerbare representaties oplevert dan uitgebreide token-voor-token redenering. Projectpagina: https://xiaomi-embodied-intelligence.github.io/OneVL

English

Chain-of-Thought (CoT) reasoning has become a powerful driver of trajectory prediction in VLA-based autonomous driving, yet its autoregressive nature imposes a latency cost that is prohibitive for real-time deployment. Latent CoT methods attempt to close this gap by compressing reasoning into continuous hidden states, but consistently fall short of their explicit counterparts. We suggest that this is due to purely linguistic latent representations compressing a symbolic abstraction of the world, rather than the causal dynamics that actually govern driving. Thus, we present OneVL (One-step latent reasoning and planning with Vision-Language explanations), a unified VLA and World Model framework that routes reasoning through compact latent tokens supervised by dual auxiliary decoders. Alongside a language decoder that reconstructs text CoT, we introduce a visual world model decoder that predicts future-frame tokens, forcing the latent space to internalize the causal dynamics of road geometry, agent motion, and environmental change. A three-stage training pipeline progressively aligns these latents with trajectory, language, and visual objectives, ensuring stable joint optimization. At inference, the auxiliary decoders are discarded and all latent tokens are prefilled in a single parallel pass, matching the speed of answer-only prediction. Across four benchmarks, OneVL becomes the first latent CoT method to surpass explicit CoT, delivering state-of-the-art accuracy at answer-only latency, and providing direct evidence that tighter compression, when guided in both language and world-model supervision, produces more generalizable representations than verbose token-by-token reasoning. Project Page: https://xiaomi-embodied-intelligence.github.io/OneVL

OneVL: Eénstaps Latente Redenering en Planning met Visueel-Taalmatige Uitleg

OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation

Samenvatting

Support