τ_0-WM: 로봇 조작을 위한 통합 비디오-행동 세계 모델

초록

로봇 조작은 물리적 실행 전에 미래 결과를 예측하고 평가하면서 실행 가능한 행동을 생성하는 모델을 필요로 한다. 본 논문에서는 단일 미래 예측 프레임워크 내에서 정책 학습, 비디오 예측, 행동 평가를 통합하는 통합 비디오-행동 월드 모델인 τ₀-World Model (τ₀-WM)을 제시한다. 공유 비디오 확산 백본을 기반으로 구축된 τ₀-WM은 두 가지 상호 보완적 인터페이스를 제공한다. 첫째, 비디오 행동 모델은 다중 뷰 관찰, 언어 명령, 로봇 상태로부터 미래 시각적 잠재 변수와 연속적 행동 청크를 함께 예측한다. 둘째, 행동 조건부 비디오 시뮬레이터는 후보 행동 청크를 다중 뷰 미래로 전개하고 밀집된 작업 진행 점수를 예측한다. 모델은 약 27,300시간의 실제 로봇 원격 조작, UMI 스타일 상호작용, 자기 중심 인간 비디오, 롤아웃 또는 실패 궤적을 포함한 데이터에 대해 양식별 감독 마스크를 사용하여 훈련된다. 추론 시 τ₀-WM은 테스트 시간 계산을 사용하여 행동 후보를 샘플링하고, 재노이즈 제거 일관성으로 순위를 매기며, 저품질 후보에 대해 시뮬레이터 기반 보정을 호출한다. 도전적인 장기 지평 및 세밀한 로봇 조작 작업에서 τ₀-WM은 다른 관련 기준 모델보다 우수한 성능을 보인다.

English

Robotic manipulation requires models that generate executable actions while anticipating and evaluating their future consequences before physical execution. We present τ_0-World Model (τ_0-WM), a unified video-action world model that integrates policy learning, video prediction, and action evaluation within a single future-predictive framework. Built on a shared video diffusion backbone, τ_0-WM provides two complementary interfaces. First, a video action model jointly predicts future visual latents and continuous action chunks from multi-view observations, language instructions, and robot state. Second, an action-conditioned video simulator rolls out candidate action chunks into multi-view futures and predicts dense task-progress scores. The model is trained on approximately 27{,}300 hours of real-robot teleoperation, UMI-style interaction, egocentric human videos, and rollout or failure trajectories using modality-specific supervision masks. At inference time, τ_0-WM uses test-time computation to sample action candidates, rank them with re-denoising consistency, and invoke simulator-based rectification for low-quality candidates. On challenging long-horizon and fine-grained robotic manipulation tasks, τ_0-WM shows superior performance over other relevant baselines.