PAIWorld: 로봇 조작을 위한 3D 일관성 있는 세계 기반 모델

초록

세계 기반 모델(World Foundation Models, WFM)은 강력한 시뮬레이터이지만, 주로 단일 시점(single-view) 설정에서 동작하며 로봇 조작에 필요한 다중 시점 3D 일관성(multi-view 3D consistency)이 부족하다. 로봇 시스템은 정책 학습을 위해 여러 카메라(자기중심적 시점, 눈-손 시점, 손목 장착형 시점)에 의존하지만, 현재의 다중 시점 세계 모델은 명시적인 기하학적 추론 없이 시점 토큰을 단순히 연결(concatenation)한다. 이로 인해 시점 간 객체 드리프트(cross-view object drift), 깊이 불일치, 텍스처 정렬 불일치가 발생한다. 우리는 이러한 실패의 원인을 명시적인 시점 간 통신 메커니즘의 부재와 3D 기하학적 사전 지식의 부족이라는 두 가지 결함으로 추적한다. 또한 이 두 문제를 동시에 해결하는 것이 필요충분조건이라고 주장한다. 이 문제를 해결하기 위해 우리는 확산-변환기(diffusion-transformer) 세계 모델을 세 가지 핵심 구성 요소로 보강하는 프레임워크인 PAIWorld를 제시한다: (1) 시점 간 명시적 경로를 구축하는 기하학 인식 교차 시점 주의(Geometry-Aware Cross-View Attention) 블록, (2) 카메라 광선 방향과 외부 자세를 주의 메커니즘에 인코딩하는 기하학적 회전 위치 임베딩(Geometric Rotary Position Embedding), (3) 고정된 3D 기반 모델로부터 3D 인식 특징을 증류(distill)하여 3D 일관성을 보장하는 잠재 3D-REPA(Latent 3D-REPA). DiT 기반 세계 기반 모델 위에 구축된 PAIWorld는 로봇 조작 벤치마크에서 최첨단 다중 시점 3D 일관성을 달성하여 WorldArena 리더보드에서 1위, AgiBot-Challenge2026 리더보드에서 2위를 기록했으며, 모델 기반 계획, 세계 행동 모델, 다중 시점 정책 사후 훈련과 같은 하위 응용을 가능하게 한다.

English

World foundation models (WFMs) are powerful simulators, yet they predominantly operate in a single-view setting and lack the multi-view 3D consistency required for robotic manipulation. While robotic systems rely on multiple cameras (egocentric, eye-to-hand, and wrist-mounted) for policy learning, current multi-view world models simply concatenate view tokens without explicit geometric reasoning. This causes cross-view object drift, depth inconsistency, and texture misalignment. We trace these failures to two deficiencies: the absence of an explicit inter-view communication mechanism and the lack of a 3D geometric prior. We argue that resolving both simultaneously is necessary and sufficient. To address this, we present PAIWorld, a framework that augments diffusion-transformer world models via three core components: (1) Geometry-Aware Cross-View Attention blocks that establish an explicit pathway across views, (2) Geometric Rotary Position Embedding that encodes camera ray directions and extrinsic poses into the attention mechanism, and (3) Latent 3D-REPA, which distills 3D-aware features from frozen 3D foundation models to ensure 3D consistency. Built upon a DiT-based world foundation model, PAIWorld achieves state-of-the-art multi-view 3D consistency on robotic manipulation benchmarks, ranking 1st on the WorldArena leaderboard and 2nd on the AgiBot-Challenge2026 leaderboard, while enabling downstream applications such as model-based planning, world action models, and multi-view policy post-training.