DreamVLA: 포괄적인 세계 지식으로 꿈꾸는 비전-언어-행동 모델

초록

최근 비전-언어-행동(Vision-Language-Action, VLA) 모델의 발전은 이미지 생성과 행동 예측을 통합하여 로봇 조작에서의 일반화 및 추론 능력을 향상시킬 가능성을 보여주고 있습니다. 그러나 기존 방법들은 중복 정보를 포함하고 동적, 공간적, 의미적 정보를 포함한 포괄적이고 중요한 세계 지식이 부족한 도전적인 이미지 기반 예측에 국한되어 있습니다. 이러한 한계를 해결하기 위해, 우리는 포괄적인 세계 지식 예측을 통합하여 역동적 모델링을 가능하게 하는 새로운 VLA 프레임워크인 DreamVLA를 제안합니다. 이를 통해 조작 작업을 위한 인지-예측-행동 루프를 구축합니다. 구체적으로, DreamVLA는 동적 영역 기반 세계 지식 예측을 공간적 및 의미적 단서와 통합하여, 행동 계획을 위한 간결하면서도 포괄적인 표현을 제공합니다. 이 설계는 인간이 행동하기 전에 추상적인 다중 모드 추론 체인을 형성하는 방식과 일치합니다. 훈련 중 동적, 공간적, 의미적 정보 간의 간섭을 완화하기 위해, 우리는 상호 주의를 마스킹하여 정보 누출을 방지하고 각 표현을 깨끗하고 분리된 상태로 유지하는 블록 구조화된 주의 메커니즘을 채택합니다. 또한, 미래 행동에 대한 조건부 분포를 모델링하기 위해, 공유 잠재 특성에서 행동 표현을 분리하는 확산 기반 트랜스포머를 사용합니다. 실제 환경과 시뮬레이션 환경에서의 광범위한 실험을 통해, DreamVLA가 실제 로봇 작업에서 76.7%의 성공률과 CALVIN ABC-D 벤치마크에서 4.44의 평균 길이를 달성함을 입증했습니다.

English

Recent advances in vision-language-action (VLA) models have shown promise in integrating image generation with action prediction to improve generalization and reasoning in robot manipulation. However, existing methods are limited to challenging image-based forecasting, which suffers from redundant information and lacks comprehensive and critical world knowledge, including dynamic, spatial and semantic information. To address these limitations, we propose DreamVLA, a novel VLA framework that integrates comprehensive world knowledge forecasting to enable inverse dynamics modeling, thereby establishing a perception-prediction-action loop for manipulation tasks. Specifically, DreamVLA introduces a dynamic-region-guided world knowledge prediction, integrated with the spatial and semantic cues, which provide compact yet comprehensive representations for action planning. This design aligns with how humans interact with the world by first forming abstract multimodal reasoning chains before acting. To mitigate interference among the dynamic, spatial and semantic information during training, we adopt a block-wise structured attention mechanism that masks their mutual attention, preventing information leakage and keeping each representation clean and disentangled. Moreover, to model the conditional distribution over future actions, we employ a diffusion-based transformer that disentangles action representations from shared latent features. Extensive experiments on both real-world and simulation environments demonstrate that DreamVLA achieves 76.7% success rate on real robot tasks and 4.44 average length on the CALVIN ABC-D benchmarks.

DreamVLA: 포괄적인 세계 지식으로 꿈꾸는 비전-언어-행동 모델

DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge

초록

Support