VTAM: VLA를 넘어선 복잡한 물리적 상호작용을 위한 비디오-촉각-행동 모델

초록

비디오-액션 모델(VAM)은 구현형 인텔리전스를 위한 유망한 프레임워크로 부상하며, 원시 비디오 스트림에서 암묵적인 세계 역학을 학습하여 시간적으로 일관된 행동 예측을 생성합니다. 이러한 모델은 시각적 추론을 통한 장기간 작업에서 강력한 성능을 보이지만, 중요한 상호작용 상태가 시각만으로는 부분적으로 관찰되는 접촉이 풍부한 시나리오에서는 한계가 있습니다. 특히, 미세한 힘 조절 및 접촉 전환은 시각 토큰에 안정적으로 인코딩되지 않아 불안정하거나 부정확한 행동으로 이어집니다. 이러한 격차를 해소하기 위해 우리는 촉각 인식을 보완적 기반 신호로 통합한 다중 모달 세계 모델링 프레임워크인 비디오-촉각 액션 모델(VTAM)을 소개합니다. VTAM은 경량 모달리티 전송 미세 조정을 통해 사전 학습된 비디오 트랜스포머에 촉각 스트림을 증강하여, 촉각-언어 쌍 데이터나 독립적인 촉각 사전 학습 없이도 효율적인 교차 모달 표현 학습을 가능하게 합니다. 다중 모달 융합을 안정화하기 위해, 우리는 균형 잡힌 교차 모달 주의를 강제하여 행동 모델에서 시각 잠재 우위를 방지하는 촉각 정규화 손실을 도입합니다. VTAM은 접촉이 풍부한 조작에서 우수한 성능을 보이며 평균 90%의 강력한 성공률을 유지합니다. 높은 충실도의 힘 인식이 필요한 감자칩 집어 올리기와 같은 도전적인 시나리오에서 VTAM은 π 0.5 기준선을 80% 앞섭니다. 우리의 연구 결과는 촉각 피드백 통합이 세계 행동 모델의 시각적 추정 오류를 수정하는 데 필수적이며, 물리적으로 기반을 둔 구현형 파운데이션 모델에 대한 확장 가능한 접근 방식을 제공함을 보여줍니다.

English

Video-Action Models (VAMs) have emerged as a promising framework for embodied intelligence, learning implicit world dynamics from raw video streams to produce temporally consistent action predictions. Although such models demonstrate strong performance on long-horizon tasks through visual reasoning, they remain limited in contact-rich scenarios where critical interaction states are only partially observable from vision alone. In particular, fine-grained force modulation and contact transitions are not reliably encoded in visual tokens, leading to unstable or imprecise behaviors. To bridge this gap, we introduce the Video-Tactile Action Model (VTAM), a multimodal world modeling framework that incorporates tactile perception as a complementary grounding signal. VTAM augments a pretrained video transformer with tactile streams via a lightweight modality transfer finetuning, enabling efficient cross-modal representation learning without tactile-language paired data or independent tactile pretraining. To stabilize multimodal fusion, we introduce a tactile regularization loss that enforces balanced cross-modal attention, preventing visual latent dominance in the action model. VTAM demonstrates superior performance in contact-rich manipulation, maintaining a robust success rate of 90 percent on average. In challenging scenarios such as potato chip pick-and-place requiring high-fidelity force awareness, VTAM outperforms the pi 0.5 baseline by 80 percent. Our findings demonstrate that integrating tactile feedback is essential for correcting visual estimation errors in world action models, providing a scalable approach to physically grounded embodied foundation models.

VTAM: VLA를 넘어선 복잡한 물리적 상호작용을 위한 비디오-촉각-행동 모델

VTAM: Video-Tactile-Action Models for Complex Physical Interaction Beyond VLAs

초록

Support