VTAM：面向复杂物理交互的超视觉语言模型的视频-触觉-动作模型

摘要

视频动作模型（VAM）已成为具身智能领域的重要框架，其通过从原始视频流中学习隐式世界动态，生成时序一致的动作预测。尽管此类模型通过视觉推理在长周期任务中表现优异，但在仅凭视觉无法完整观测关键交互状态的密集接触场景中仍存在局限。特别是视觉标记无法可靠编码细粒度力调节与接触转换，导致行为不稳定或不精确。为弥补这一缺陷，我们提出视频-触觉动作模型（VTAM），这是一种融合触觉感知作为互补 grounding 信号的多模态世界建模框架。VTAM通过轻量级模态迁移微调将触觉流集成至预训练视频变换器，无需触觉-语言配对数据或独立触觉预训练即可实现高效跨模态表征学习。为稳定多模态融合，我们引入触觉正则化损失以强化跨模态注意力均衡，防止动作模型中视觉潜变量的主导地位。VTAM在密集接触操作中展现出卓越性能，平均保持90%的稳健成功率。在需要高保真力感知的挑战性场景（如薯片抓取任务）中，VTAM较π0.5基线提升80%性能。我们的研究证明，整合触觉反馈对于修正世界动作模型中的视觉估计误差至关重要，为物理 grounded 的具身基础模型提供了可扩展路径。

English

Video-Action Models (VAMs) have emerged as a promising framework for embodied intelligence, learning implicit world dynamics from raw video streams to produce temporally consistent action predictions. Although such models demonstrate strong performance on long-horizon tasks through visual reasoning, they remain limited in contact-rich scenarios where critical interaction states are only partially observable from vision alone. In particular, fine-grained force modulation and contact transitions are not reliably encoded in visual tokens, leading to unstable or imprecise behaviors. To bridge this gap, we introduce the Video-Tactile Action Model (VTAM), a multimodal world modeling framework that incorporates tactile perception as a complementary grounding signal. VTAM augments a pretrained video transformer with tactile streams via a lightweight modality transfer finetuning, enabling efficient cross-modal representation learning without tactile-language paired data or independent tactile pretraining. To stabilize multimodal fusion, we introduce a tactile regularization loss that enforces balanced cross-modal attention, preventing visual latent dominance in the action model. VTAM demonstrates superior performance in contact-rich manipulation, maintaining a robust success rate of 90 percent on average. In challenging scenarios such as potato chip pick-and-place requiring high-fidelity force awareness, VTAM outperforms the pi 0.5 baseline by 80 percent. Our findings demonstrate that integrating tactile feedback is essential for correcting visual estimation errors in world action models, providing a scalable approach to physically grounded embodied foundation models.

VTAM：面向复杂物理交互的超视觉语言模型的视频-触觉-动作模型

VTAM: Video-Tactile-Action Models for Complex Physical Interaction Beyond VLAs

摘要

Support