ChatPaper.aiChatPaper

PAIWorld:面向机器人操作的三维一致性世界基礎模型

PAIWorld: A 3D-Consistent World Foundation Model for Robotic Manipulation

June 16, 2026
作者: Yuhang Huang, Xuan Lv, Junyan Xu, Zhiyuan Yu, Jiazhao Zhang, Ruizhen Hu, Wancheng Feng, Shilong Zou, Hewen Xiao, Ziqiao Zhou, Kaiyun Huang, Zhiyu Peng, Juzhan Xu, Hang Zhao, Chenyang Zhu, Renjiao Yi, Yifei Huang, Douhui Wu, Yan Zhang, Kexu Cheng, Chunhe Song, Yunzhi Xue, Xiuhong Zhang, Leitao Guo, Yunji Chen, Bin Wu, Haibin Yu, Kai Xu
cs.AI

摘要

世界基礎模型(WFMs)是強大的模擬器,但它們主要運行於單視角設定中,缺乏機器人操作所需的多視角3D一致性。儘管機器人系統依賴多個攝影機(自我中心視角、眼對手視角及腕裝攝影機)進行策略學習,現有的多視角世界模型僅將視角標記直接串接,缺乏明確的幾何推理。這導致跨視角物體漂移、深度不一致性以及紋理錯位。我們將這些問題歸因於兩項缺陷:缺乏明確的跨視角通訊機制,以及缺少3D幾何先驗。我們認為同時解決這兩項缺陷是必要且充分的。為此,我們提出PAIWorld,這是一個透過三大核心組件來增強擴散變壓器世界模型的框架:(1) 幾何感知跨視角注意力區塊,建立視角間的明確傳遞路徑;(2) 幾何旋轉位置編碼,將攝影機光線方向與外部姿態編碼至注意力機制中;(3) 潛在3D-REPA,從凍結的3D基礎模型中萃取3D感知特徵,以確保3D一致性。PAIWorld基於DiT式世界基礎模型,在機器人操作基準測試中達到了最先進的多視角3D一致性,於WorldArena排行榜上排名第一,在AgiBot-Challenge2026排行榜上排名第二,同時支援模型為基礎的規劃、世界行動模型及多視角策略後訓練等下游應用。
English
World foundation models (WFMs) are powerful simulators, yet they predominantly operate in a single-view setting and lack the multi-view 3D consistency required for robotic manipulation. While robotic systems rely on multiple cameras (egocentric, eye-to-hand, and wrist-mounted) for policy learning, current multi-view world models simply concatenate view tokens without explicit geometric reasoning. This causes cross-view object drift, depth inconsistency, and texture misalignment. We trace these failures to two deficiencies: the absence of an explicit inter-view communication mechanism and the lack of a 3D geometric prior. We argue that resolving both simultaneously is necessary and sufficient. To address this, we present PAIWorld, a framework that augments diffusion-transformer world models via three core components: (1) Geometry-Aware Cross-View Attention blocks that establish an explicit pathway across views, (2) Geometric Rotary Position Embedding that encodes camera ray directions and extrinsic poses into the attention mechanism, and (3) Latent 3D-REPA, which distills 3D-aware features from frozen 3D foundation models to ensure 3D consistency. Built upon a DiT-based world foundation model, PAIWorld achieves state-of-the-art multi-view 3D consistency on robotic manipulation benchmarks, ranking 1st on the WorldArena leaderboard and 2nd on the AgiBot-Challenge2026 leaderboard, while enabling downstream applications such as model-based planning, world action models, and multi-view policy post-training.