체인의 세계: 잠재 운동에서의 세계 모델 사고

초록

Vision-Language-Action(VLA) 모델은 구현된 지능(embodied intelligence)을 향한 유망한 방향이지만, 시각 역학의 기저에 있는 예측 및 시간-인과 구조를 종종 간과합니다. World-model VLA는 미래 프레임을 예측함으로써 이 문제를 해결하지만, 중복된 배경을 재구성하는 데 용량을 낭비합니다. Latent-action VLA는 프레임 간 전환을 간결하게 인코딩하지만, 시간적으로 연속적인 동적 모델링과 세계 지식(world knowledge)이 부족합니다. 이러한 한계를 극복하기 위해 우리는 World-model의 시간적 추론과 분리된 잠재 운동 표현을 통합하는 새로운 "Chain of World" 패러다임인 CoWVLA(Chain-of-World VLA)를 소개합니다. 먼저, 사전 학습된 비디오 VAE가 잠재 운동 추출기 역할을 하여 비디오 세그먼트를 구조와 운동 잠재 변수로 명시적으로 분해합니다. 그런 다음, 사전 학습(pre-training) 단계에서 VLA는 지시(instruction)와 초기 프레임을 바탕으로 연속적인 잠재 운동 체인(chain)을 추론하고 해당 세그먼트의 종료 프레임을 예측하도록 학습합니다. 마지막으로, 공동 세밀 조정(co-fine-tuning) 단계에서는 통합된 자기회귀 디코더에서 희소 키프레임과 행동 시퀀스를 함께 모델링함으로써 이 잠재 동적 특성을 이산적 행동 예측과 정렬합니다. 이 설계는 시간적 추론과 세계 지식이라는 World-model의 이점을 유지하면서 잠재 행동의 간결성과 해석 가능성을 확보하여 효율적인 시각운동(visuomotor) 학습을 가능하게 합니다. 로봇 시뮬레이션 벤치마크에서의 광범위한 실험 결과, CoWVLA가 기존 World-model 및 Latent-action 접근법을 능가하며 적절한 수준의 계산 효율성을 달성하여 더 효과적인 VLA 사전 학습 패러다임으로서의 잠재력을 입증했습니다. 프로젝트 웹사이트는 https://fx-hit.github.io/cowvla-io에서 확인할 수 있습니다.

English

Vision-Language-Action (VLA) models are a promising path toward embodied intelligence, yet they often overlook the predictive and temporal-causal structure underlying visual dynamics. World-model VLAs address this by predicting future frames, but waste capacity reconstructing redundant backgrounds. Latent-action VLAs encode frame-to-frame transitions compactly, but lack temporally continuous dynamic modeling and world knowledge. To overcome these limitations, we introduce CoWVLA (Chain-of-World VLA), a new "Chain of World" paradigm that unifies world-model temporal reasoning with a disentangled latent motion representation. First, a pretrained video VAE serves as a latent motion extractor, explicitly factorizing video segments into structure and motion latents. Then, during pre-training, the VLA learns from an instruction and an initial frame to infer a continuous latent motion chain and predict the segment's terminal frame. Finally, during co-fine-tuning, this latent dynamic is aligned with discrete action prediction by jointly modeling sparse keyframes and action sequences in a unified autoregressive decoder. This design preserves the world-model benefits of temporal reasoning and world knowledge while retaining the compactness and interpretability of latent actions, enabling efficient visuomotor learning. Extensive experiments on robotic simulation benchmarks show that CoWVLA outperforms existing world-model and latent-action approaches and achieves moderate computational efficiency, highlighting its potential as a more effective VLA pretraining paradigm. The project website can be found at https://fx-hit.github.io/cowvla-io.

체인의 세계: 잠재 운동에서의 세계 모델 사고

Chain of World: World Model Thinking in Latent Motion

초록

Support