ChatPaper.aiChatPaper

HiF-VLA:基于运动表征的视觉-语言-动作模型后见、洞见与预见

HiF-VLA: Hindsight, Insight and Foresight through Motion Representation for Vision-Language-Action Models

December 10, 2025
作者: Minghui Lin, Pengxiang Ding, Shu Wang, Zifeng Zhuang, Yang Liu, Xinyang Tong, Wenxuan Song, Shangke Lyu, Siteng Huang, Donglin Wang
cs.AI

摘要

视觉-语言-动作(VLA)模型近期通过将视觉与语言线索映射为动作,实现了机器人操控能力。然而多数VLA模型遵循马尔可夫假设,仅依赖当前观测状态,因而存在时间短视问题,导致长时序任务中的连贯性下降。本研究将运动视为一种更紧凑且信息丰富的时序上下文与世界动态表征,既能捕捉状态间变化又可过滤静态像素级噪声。基于此,我们提出HiF-VLA框架(后瞻-洞察-前瞻协同的VLA),这一统一框架利用运动信息进行双向时序推理。HiF-VLA通过后验先验编码历史动态,借助前瞻推理预测未来运动,并通过后验调节的联合专家模块实现"边行动边思考"的长时序操控范式。实验表明,HiF-VLA在LIBERO-Long和CALVIN ABC-D基准测试中均超越强基线模型,且推理延迟几乎无增加。此外,在真实世界长时序操控任务中,HiF-VLA取得了显著性能提升,证明了其在现实机器人场景中的广泛有效性。
English
Vision-Language-Action (VLA) models have recently enabled robotic manipulation by grounding visual and linguistic cues into actions. However, most VLAs assume the Markov property, relying only on the current observation and thus suffering from temporal myopia that degrades long-horizon coherence. In this work, we view motion as a more compact and informative representation of temporal context and world dynamics, capturing inter-state changes while filtering static pixel-level noise. Building on this idea, we propose HiF-VLA (Hindsight, Insight, and Foresight for VLAs), a unified framework that leverages motion for bidirectional temporal reasoning. HiF-VLA encodes past dynamics through hindsight priors, anticipates future motion via foresight reasoning, and integrates both through a hindsight-modulated joint expert to enable a ''think-while-acting'' paradigm for long-horizon manipulation. As a result, HiF-VLA surpasses strong baselines on LIBERO-Long and CALVIN ABC-D benchmarks, while incurring negligible additional inference latency. Furthermore, HiF-VLA achieves substantial improvements in real-world long-horizon manipulation tasks, demonstrating its broad effectiveness in practical robotic settings.
PDF102December 13, 2025