ChatPaper.aiChatPaper

稀疏视频生成技术推动现实世界超视距视觉语言导航发展

Sparse Video Generation Propels Real-World Beyond-the-View Vision-Language Navigation

February 5, 2026
作者: Hai Zhang, Siqi Liang, Li Chen, Yuxian Li, Yukuan Xu, Yichao Zhong, Fu Zhang, Hongyang Li
cs.AI

摘要

为何视觉语言导航必须依赖详尽繁琐的语言指令?虽然这种细节描述能简化决策过程,却与真实世界导航的根本目标背道而驰。理想情况下,智能体应具备在未知环境中仅凭简单高层意图自主导航的能力。实现这一愿景带来了严峻挑战:超视野导航(BVN)要求智能体在没有密集逐步指引的情况下定位远处不可见的目标。现有基于大语言模型的方法虽擅长遵循细致指令,却因依赖短视域监督而常出现短视行为。但若简单扩展监督视域,又会破坏大语言模型训练的稳定性。本研究首次发现视频生成模型天生具备通过长视域监督与语言指令对齐的优势,使其特别适用于BVN任务。基于这一洞见,我们首次将视频生成模型引入该领域。然而生成数十秒视频的惊人延迟使得实际部署难以实现。为此我们提出SparseVideoNav,通过生成跨越20秒视域的稀疏未来轨迹实现亚秒级路径推断,相比未优化版本获得27倍的惊人加速。大量真实场景零样本实验表明,SparseVideoNav在BVN任务上的成功率达到顶尖大语言模型基线的2.5倍,并首次在极具挑战性的夜间场景中实现此类能力。
English
Why must vision-language navigation be bound to detailed and verbose language instructions? While such details ease decision-making, they fundamentally contradict the goal for navigation in the real-world. Ideally, agents should possess the autonomy to navigate in unknown environments guided solely by simple and high-level intents. Realizing this ambition introduces a formidable challenge: Beyond-the-View Navigation (BVN), where agents must locate distant, unseen targets without dense and step-by-step guidance. Existing large language model (LLM)-based methods, though adept at following dense instructions, often suffer from short-sighted behaviors due to their reliance on short-horimzon supervision. Simply extending the supervision horizon, however, destabilizes LLM training. In this work, we identify that video generation models inherently benefit from long-horizon supervision to align with language instructions, rendering them uniquely suitable for BVN tasks. Capitalizing on this insight, we propose introducing the video generation model into this field for the first time. Yet, the prohibitive latency for generating videos spanning tens of seconds makes real-world deployment impractical. To bridge this gap, we propose SparseVideoNav, achieving sub-second trajectory inference guided by a generated sparse future spanning a 20-second horizon. This yields a remarkable 27x speed-up compared to the unoptimized counterpart. Extensive real-world zero-shot experiments demonstrate that SparseVideoNav achieves 2.5x the success rate of state-of-the-art LLM baselines on BVN tasks and marks the first realization of such capability in challenging night scenes.
PDF101February 14, 2026