PAIWorld: 3D一貫性のあるロボット操作のための世界基盤モデル

要旨

ワールド基盤モデル（WFM）は強力なシミュレータであるが、その大半は単一視点設定で動作し、ロボット操作に必要なマルチビュー3D一貫性を欠いている。ロボットシステムはポリシー学習に複数のカメラ（自己中心視点、対手視点、手首装着型など）を活用するが、現在のマルチビューワールドモデルは明示的な幾何学的推論を行わずにビュートークンを単純に結合している。このため、視点間での物体のずれ、深度の不整合、テクスチャの不一致が生じる。我々はこれらの失敗が、視点間の明示的な通信機構の欠如と3D幾何学的事前知識の欠如という2つの欠陥に起因することを突き止めた。両方を同時に解決することが必要かつ十分であると我々は主張する。この問題に対処するため、我々はPAIWorldを提案する。これは拡散トランスフォーマーを基盤とするワールドモデルを、以下の3つの中核的構成要素で拡張するフレームワークである。(1) 視点間の明示的な経路を確立する幾何認識型クロスビューアテンションブロック、(2) カメラのレイ方向と外部パラメータ（姿勢）をアテンション機構に符号化する幾何学的ロータリー位置エンコーディング、(3) 凍結された3D基盤モデルから3D認識特徴を蒸留して3D一貫性を確保する潜在3D-REPA。DiTベースのワールド基盤モデル上に構築されたPAIWorldは、ロボット操作ベンチマークにおいて最先端のマルチビュー3D一貫性を達成し、WorldArenaリーダーボードで第1位、AgiBot-Challenge2026リーダーボードで第2位を獲得した。さらに、モデルベース計画、ワールド行動モデル、マルチビューポリシー後学習といった下流アプリケーションを可能にする。

English

World foundation models (WFMs) are powerful simulators, yet they predominantly operate in a single-view setting and lack the multi-view 3D consistency required for robotic manipulation. While robotic systems rely on multiple cameras (egocentric, eye-to-hand, and wrist-mounted) for policy learning, current multi-view world models simply concatenate view tokens without explicit geometric reasoning. This causes cross-view object drift, depth inconsistency, and texture misalignment. We trace these failures to two deficiencies: the absence of an explicit inter-view communication mechanism and the lack of a 3D geometric prior. We argue that resolving both simultaneously is necessary and sufficient. To address this, we present PAIWorld, a framework that augments diffusion-transformer world models via three core components: (1) Geometry-Aware Cross-View Attention blocks that establish an explicit pathway across views, (2) Geometric Rotary Position Embedding that encodes camera ray directions and extrinsic poses into the attention mechanism, and (3) Latent 3D-REPA, which distills 3D-aware features from frozen 3D foundation models to ensure 3D consistency. Built upon a DiT-based world foundation model, PAIWorld achieves state-of-the-art multi-view 3D consistency on robotic manipulation benchmarks, ranking 1st on the WorldArena leaderboard and 2nd on the AgiBot-Challenge2026 leaderboard, while enabling downstream applications such as model-based planning, world action models, and multi-view policy post-training.