HiF-VLA:基於運動表徵的視覺語言動作模型之後見、洞見與前瞻
HiF-VLA: Hindsight, Insight and Foresight through Motion Representation for Vision-Language-Action Models
December 10, 2025
作者: Minghui Lin, Pengxiang Ding, Shu Wang, Zifeng Zhuang, Yang Liu, Xinyang Tong, Wenxuan Song, Shangke Lyu, Siteng Huang, Donglin Wang
cs.AI
摘要
視覺-語言-行動(VLA)模型近期透過將視覺與語言線索轉化為動作,實現了機器人操作任務。然而多數VLA模型基於馬可夫假設,僅依賴當前觀測值,因而存在時間短視性問題,導致長時序任務的連貫性受損。本研究提出將運動視為更緊湊且富含資訊的時序上下文與世界動態表徵,既能捕捉狀態間的變化,又可過濾靜態像素雜訊。基於此理念,我們提出HiF-VLA(後瞻、洞見與前瞻融合架構),這是一個利用運動進行雙向時序推理的統一框架。HiF-VLA透過後驗先驗編碼過往動態,經由前瞻推理預測未來運動,並通過後驗調控的聯合專家模組實現「邊行動邊思考」的長時序操作範式。實驗結果表明,HiF-VLA在LIBERO-Long與CALVIN ABC-D基準測試中均超越強基線模型,且推理延遲僅微幅增加。此外,HiF-VLA在真實世界的長時序操作任務中實現顯著提升,展現其於實際機器人應用的廣泛有效性。
English
Vision-Language-Action (VLA) models have recently enabled robotic manipulation by grounding visual and linguistic cues into actions. However, most VLAs assume the Markov property, relying only on the current observation and thus suffering from temporal myopia that degrades long-horizon coherence. In this work, we view motion as a more compact and informative representation of temporal context and world dynamics, capturing inter-state changes while filtering static pixel-level noise. Building on this idea, we propose HiF-VLA (Hindsight, Insight, and Foresight for VLAs), a unified framework that leverages motion for bidirectional temporal reasoning. HiF-VLA encodes past dynamics through hindsight priors, anticipates future motion via foresight reasoning, and integrates both through a hindsight-modulated joint expert to enable a ''think-while-acting'' paradigm for long-horizon manipulation. As a result, HiF-VLA surpasses strong baselines on LIBERO-Long and CALVIN ABC-D benchmarks, while incurring negligible additional inference latency. Furthermore, HiF-VLA achieves substantial improvements in real-world long-horizon manipulation tasks, demonstrating its broad effectiveness in practical robotic settings.