ChatPaper.aiChatPaper

稀疏视频生成技术推动现实世界超视野视觉语言导航发展

Sparse Video Generation Propels Real-World Beyond-the-View Vision-Language Navigation

February 5, 2026
作者: Hai Zhang, Siqi Liang, Li Chen, Yuxian Li, Yukuan Xu, Yichao Zhong, Fu Zhang, Hongyang Li
cs.AI

摘要

为何视觉语言导航必须依赖详尽繁琐的语言指令?尽管细节描述能简化决策过程,但这种设定本质上与现实世界的导航需求相悖。理想情况下,智能体应具备在未知环境中仅凭简单高层意图自主导航的能力。实现这一愿景带来了严峻挑战:超视距导航(BVN)要求智能体在没有密集逐步指引的情况下定位远处不可见的目标。现有基于大语言模型(LLM)的方法虽擅长执行细致指令,却因依赖短视距监督常表现出目光短浅的行为。然而单纯延长监督范围会导致LLM训练失稳。本研究发现,视频生成模型天然受益于长视距监督以实现与语言指令的对齐,这使其特别适用于BVN任务。基于此发现,我们首次将视频生成模型引入该领域。但生成数十秒视频的惊人延迟使其难以实际部署。为此,我们提出SparseVideoNav框架,通过生成跨越20秒视距的稀疏未来场景引导轨迹推断,实现亚秒级推理速度,较未优化版本提升27倍。大量真实场景零样本实验表明,SparseVideoNav在BVN任务上的成功率达到顶尖LLM基线的2.5倍,并首次实现了在极具挑战性的夜间场景中的导航能力。
English
Why must vision-language navigation be bound to detailed and verbose language instructions? While such details ease decision-making, they fundamentally contradict the goal for navigation in the real-world. Ideally, agents should possess the autonomy to navigate in unknown environments guided solely by simple and high-level intents. Realizing this ambition introduces a formidable challenge: Beyond-the-View Navigation (BVN), where agents must locate distant, unseen targets without dense and step-by-step guidance. Existing large language model (LLM)-based methods, though adept at following dense instructions, often suffer from short-sighted behaviors due to their reliance on short-horimzon supervision. Simply extending the supervision horizon, however, destabilizes LLM training. In this work, we identify that video generation models inherently benefit from long-horizon supervision to align with language instructions, rendering them uniquely suitable for BVN tasks. Capitalizing on this insight, we propose introducing the video generation model into this field for the first time. Yet, the prohibitive latency for generating videos spanning tens of seconds makes real-world deployment impractical. To bridge this gap, we propose SparseVideoNav, achieving sub-second trajectory inference guided by a generated sparse future spanning a 20-second horizon. This yields a remarkable 27x speed-up compared to the unoptimized counterpart. Extensive real-world zero-shot experiments demonstrate that SparseVideoNav achieves 2.5x the success rate of state-of-the-art LLM baselines on BVN tasks and marks the first realization of such capability in challenging night scenes.
PDF101February 14, 2026