ChatPaper.aiChatPaper

PAIWorld: 面向机器人操作的3D一致世界基础模型

PAIWorld: A 3D-Consistent World Foundation Model for Robotic Manipulation

June 16, 2026
作者: Yuhang Huang, Xuan Lv, Junyan Xu, Zhiyuan Yu, Jiazhao Zhang, Ruizhen Hu, Wancheng Feng, Shilong Zou, Hewen Xiao, Ziqiao Zhou, Kaiyun Huang, Zhiyu Peng, Juzhan Xu, Hang Zhao, Chenyang Zhu, Renjiao Yi, Yifei Huang, Douhui Wu, Yan Zhang, Kexu Cheng, Chunhe Song, Yunzhi Xue, Xiuhong Zhang, Leitao Guo, Yunji Chen, Bin Wu, Haibin Yu, Kai Xu
cs.AI

摘要

世界基础模型(WFMs)是强大的模拟器,但主要运行在单视图场景中,缺乏机器人操作所需的多视图3D一致性。尽管机器人系统依赖多个摄像头(第一人称视角、眼-手协调视角及腕部安装视角)进行策略学习,但当前的多视图世界模型仅简单拼接视图标记,缺乏显式的几何推理,导致跨视图物体偏移、深度不一致及纹理错位。我们将这些问题归因于两个缺陷:缺乏显式的跨视图通信机制,以及缺少3D几何先验。我们认为同时解决这两个缺陷是必要且充分的。为此,我们提出PAIWorld框架,通过三个核心组件增强扩散变换器世界模型:(1)几何感知跨视图注意力模块,在视图间建立显式交互路径;(2)几何旋转位置编码,将相机光线方向与外部位姿编码至注意力机制中;(3)隐式3D-REPA,从冻结的3D基础模型中蒸馏3D感知特征以确保3D一致性。基于DiT世界基础模型,PAIWorld在机器人操作基准测试中实现了最先进的多视图3D一致性,在WorldArena排行榜中位列第一,在AgiBot-Challenge2026排行榜中位列第二,同时支持基于模型的规划、世界动作模型及多视图策略后训练等下游应用。
English
World foundation models (WFMs) are powerful simulators, yet they predominantly operate in a single-view setting and lack the multi-view 3D consistency required for robotic manipulation. While robotic systems rely on multiple cameras (egocentric, eye-to-hand, and wrist-mounted) for policy learning, current multi-view world models simply concatenate view tokens without explicit geometric reasoning. This causes cross-view object drift, depth inconsistency, and texture misalignment. We trace these failures to two deficiencies: the absence of an explicit inter-view communication mechanism and the lack of a 3D geometric prior. We argue that resolving both simultaneously is necessary and sufficient. To address this, we present PAIWorld, a framework that augments diffusion-transformer world models via three core components: (1) Geometry-Aware Cross-View Attention blocks that establish an explicit pathway across views, (2) Geometric Rotary Position Embedding that encodes camera ray directions and extrinsic poses into the attention mechanism, and (3) Latent 3D-REPA, which distills 3D-aware features from frozen 3D foundation models to ensure 3D consistency. Built upon a DiT-based world foundation model, PAIWorld achieves state-of-the-art multi-view 3D consistency on robotic manipulation benchmarks, ranking 1st on the WorldArena leaderboard and 2nd on the AgiBot-Challenge2026 leaderboard, while enabling downstream applications such as model-based planning, world action models, and multi-view policy post-training.