ChatPaper.aiChatPaper

VLA-4D:将四维感知嵌入视觉-语言-动作模型以实现时空一致的机器人操控

VLA-4D: Embedding 4D Awareness into Vision-Language-Action Models for SpatioTemporally Coherent Robotic Manipulation

November 21, 2025
作者: Hanyu Zhou, Chuanhao Ma, Gim Hee Lee
cs.AI

摘要

視覺-語言-動作模型在通用機器人任務中展現潛力,但在需要細粒度表徵的時空一致性操作任務中仍面臨挑戰。現有方法通常將三維座標嵌入視覺表徵以提升動作的空間精度,但這些方法難以實現對動作執行的時序一致性控制。本研究提出具備四維感知能力的通用模型VLA-4D,用於實現時空協同的機器人操作。我們的模型基於兩項核心設計:1)四維感知視覺表徵:通過提取視覺特徵,將一維時間嵌入三維座標形成四維嵌入,並經由交叉注意力機制融合為統一視覺表徵;2)時空動作表徵:在傳統空間動作表徵基礎上引入時間維度以實現時空規劃,並將多模態表徵對齊至大語言模型進行時空動作預測。在此統一框架下,所設計的視覺與動作表徵共同確保機器人操作實現空間平滑性與時間連貫性。此外,我們擴展了VLA數據集並添加時序動作標註以微調模型。大量實驗結果驗證了本方法在多種機器人操作任務中的優越性。
English
Vision-language-action (VLA) models show potential for general robotic tasks, but remain challenging in spatiotemporally coherent manipulation, which requires fine-grained representations. Typically, existing methods embed 3D positions into visual representations to enhance the spatial precision of actions. However, these methods struggle to achieve temporally coherent control over action execution. In this work, we propose VLA-4D, a general VLA model with 4D awareness for spatiotemporally coherent robotic manipulation. Our model is guided by two key designs: 1) 4D-aware visual representation. We extract visual features, embed 1D time into 3D positions for 4D embeddings, and fuse them into a unified visual representation via a cross-attention mechanism. 2) Spatiotemporal action representation. We extend conventional spatial action representations with temporal information to enable the spatiotemporal planning, and align the multimodal representations into the LLM for spatiotemporal action prediction. Within this unified framework, the designed visual and action representations jointly make robotic manipulation spatially-smooth and temporally-coherent. In addition, we extend the VLA dataset with temporal action annotations for fine-tuning our model. Extensive experiments have been conducted to verify the superiority of our method across different tasks of robotic manipulation.
PDF72December 1, 2025