VTAM: Video-Tactiele-Actie-Modellen voor Complexe Fysieke Interactie Voorbij VLA's

Samenvatting

Video-Action Models (VAMs) zijn naar voren gekomen als een veelbelovend raamwerk voor belichaamde intelligentie, waarbij ze impliciete werelddynamica leren uit onbewerkte videostreams om temporeel consistente actievoorspellingen te genereren. Hoewel dergelijke modellen sterke prestaties vertonen bij langetermijntaken dankzij visuele redenering, blijven ze beperkt in contactrijke scenario's waarin kritieke interactietoestanden slechts gedeeltelijk waarneembaar zijn via visie alleen. Met name fijnmazige krachtmodulatie en contactovergangen zijn niet betrouwbaar gecodeerd in visuele tokens, wat leidt tot instabiel of onnauwkeurig gedrag. Om deze kloof te overbruggen, introduceren we het Video-Tactile Action Model (VTAM), een multimodaal wereldmodelleerraamwerk dat tactiele waarneming integreert als een complementair grondsignaal. VTAM breidt een vooraf getrainde videotransformer uit met tactiele streams via lichtgewicht modale transfer-finetuning, waardoor efficiënte cross-modale representatielearning mogelijk wordt zonder gekoppelde tactiel-taalgegevens of onafhankelijke tactiele voorpretraining. Om multimodale fusie te stabiliseren, introduceren we een tactiel regularisatieverlies dat gebalanceerde cross-modale aandacht afdwingt en visuele latentiedominantie in het actiemodel voorkomt. VTAM demonstreert superieure prestaties bij contactrijke manipulatie, met een robuust slagingspercentage van gemiddeld 90 procent. In uitdagende scenario's, zoals het oppakken en plaatsen van chips die hoogfidelijke krachtbewustzijn vereisen, presteert VTAM 80 procent beter dan de π 0,5-basislijn. Onze bevindingen tonen aan dat de integratie van tactiele feedback essentieel is voor het corrigeren van visuele schattingsfouten in wereldactiemodellen, en biedt een schaalbare aanpak voor fysiek gegronde belichaamde foundationmodellen.

English

Video-Action Models (VAMs) have emerged as a promising framework for embodied intelligence, learning implicit world dynamics from raw video streams to produce temporally consistent action predictions. Although such models demonstrate strong performance on long-horizon tasks through visual reasoning, they remain limited in contact-rich scenarios where critical interaction states are only partially observable from vision alone. In particular, fine-grained force modulation and contact transitions are not reliably encoded in visual tokens, leading to unstable or imprecise behaviors. To bridge this gap, we introduce the Video-Tactile Action Model (VTAM), a multimodal world modeling framework that incorporates tactile perception as a complementary grounding signal. VTAM augments a pretrained video transformer with tactile streams via a lightweight modality transfer finetuning, enabling efficient cross-modal representation learning without tactile-language paired data or independent tactile pretraining. To stabilize multimodal fusion, we introduce a tactile regularization loss that enforces balanced cross-modal attention, preventing visual latent dominance in the action model. VTAM demonstrates superior performance in contact-rich manipulation, maintaining a robust success rate of 90 percent on average. In challenging scenarios such as potato chip pick-and-place requiring high-fidelity force awareness, VTAM outperforms the pi 0.5 baseline by 80 percent. Our findings demonstrate that integrating tactile feedback is essential for correcting visual estimation errors in world action models, providing a scalable approach to physically grounded embodied foundation models.

VTAM: Video-Tactiele-Actie-Modellen voor Complexe Fysieke Interactie Voorbij VLA's

VTAM: Video-Tactile-Action Models for Complex Physical Interaction Beyond VLAs

Samenvatting

Support