VTAM: VLAを超えた複雑な物理的相互作用のための映像・触覚・行動モデル

要旨

Video-Action Models（VAM）は、実体化知能の有望なフレームワークとして登場し、生のビデオストリームから暗黙的な世界の力学を学習することで、時間的に一貫した行動予測を生成する。このようなモデルは視覚的推論を通じて長期的タスクで高い性能を示すが、重要な相互作用状態が視覚のみでは部分的にしか観測できない接触豊富なシナリオでは限界がある。特に、微細な力制御や接触遷移は視覚トークンに確実に符号化されず、不安定あるいは不正確な行動を引き起こす。この隔たりを埋めるため、我々は触覚知覚を補完的な接地信号として組み込むマルチモーダル世界モデリングフレームワーク、Video-Tactile Action Model（VTAM）を提案する。VTAMは事前学習済みビデオトランスフォーマーを軽量なモダリティ転移ファインチューニングにより触覚ストリームで拡張し、触覚-言語ペアデータや独立した触覚事前学習を必要としない効率的なクロスモーダル表現学習を実現する。マルチモーダル融合を安定化させるため、行動モデルにおける視覚潜在表現の支配を防ぎ、バランスの取れたクロスモーダル注意を強化する触覚正則化損失を導入する。VTAMは接触豊富な把持操作で優れた性能を示し、平均90％の堅牢な成功率を維持する。高精度な力覚認識を要するポテトチップスのピックアンドプレースのような困難なシナリオでは、VTAMはπ0.5ベースラインを80％上回る。我々の知見は、触覚フィードバックの統合が世界行動モデルにおける視覚推定誤差を補正するために不可欠であり、物理的に接地された実体化基盤モデルへの拡張可能なアプローチを提供することを示す。

English

Video-Action Models (VAMs) have emerged as a promising framework for embodied intelligence, learning implicit world dynamics from raw video streams to produce temporally consistent action predictions. Although such models demonstrate strong performance on long-horizon tasks through visual reasoning, they remain limited in contact-rich scenarios where critical interaction states are only partially observable from vision alone. In particular, fine-grained force modulation and contact transitions are not reliably encoded in visual tokens, leading to unstable or imprecise behaviors. To bridge this gap, we introduce the Video-Tactile Action Model (VTAM), a multimodal world modeling framework that incorporates tactile perception as a complementary grounding signal. VTAM augments a pretrained video transformer with tactile streams via a lightweight modality transfer finetuning, enabling efficient cross-modal representation learning without tactile-language paired data or independent tactile pretraining. To stabilize multimodal fusion, we introduce a tactile regularization loss that enforces balanced cross-modal attention, preventing visual latent dominance in the action model. VTAM demonstrates superior performance in contact-rich manipulation, maintaining a robust success rate of 90 percent on average. In challenging scenarios such as potato chip pick-and-place requiring high-fidelity force awareness, VTAM outperforms the pi 0.5 baseline by 80 percent. Our findings demonstrate that integrating tactile feedback is essential for correcting visual estimation errors in world action models, providing a scalable approach to physically grounded embodied foundation models.

VTAM: VLAを超えた複雑な物理的相互作用のための映像・触覚・行動モデル

VTAM: Video-Tactile-Action Models for Complex Physical Interaction Beyond VLAs

要旨

Support