Utonia:迈向统一编码所有点云之路
Utonia: Toward One Encoder for All Point Clouds
March 3, 2026
作者: Yujia Zhang, Xiaoyang Wu, Yunhan Yang, Xianzhe Fan, Han Li, Yuechen Zhang, Zehao Huang, Naiyan Wang, Hengshuang Zhao
cs.AI
摘要
我们梦想着未来所有领域的点云数据能够汇聚一堂,共同构建一个惠及全域的统一模型。为实现这一愿景,我们推出Utonia——这是迈向跨领域训练统一自监督点云Transformer编码器的第一步,其训练数据涵盖遥感测绘、室外激光雷达、室内RGB-D序列、以物体为中心的CAD模型,以及从纯RGB视频中提取的点云。尽管这些数据在采集几何、密度和先验知识方面存在差异,Utonia仍能学习到跨领域一致的表示空间。这种统一不仅提升了感知能力,更揭示了仅在多领域联合训练时才会涌现的奇妙 emergent 行为。除感知任务外,我们还发现Utonia表示能赋能具身推理与多模态推理:将视觉-语言-动作策略与Utonia特征相结合可提升机器人操作性能,将其集成到视觉-语言模型中则能增强空间推理能力。我们希望Utonia能成为稀疏3D数据基础模型的奠基之作,为增强现实/虚拟现实、机器人和自动驾驶等下游应用提供支撑。
English
We dream of a future where point clouds from all domains can come together to shape a single model that benefits them all. Toward this goal, we present Utonia, a first step toward training a single self-supervised point transformer encoder across diverse domains, spanning remote sensing, outdoor LiDAR, indoor RGB-D sequences, object-centric CAD models, and point clouds lifted from RGB-only videos. Despite their distinct sensing geometries, densities, and priors, Utonia learns a consistent representation space that transfers across domains. This unification improves perception capability while revealing intriguing emergent behaviors that arise only when domains are trained jointly. Beyond perception, we observe that Utonia representations can also benefit embodied and multimodal reasoning: conditioning vision-language-action policies on Utonia features improves robotic manipulation, and integrating them into vision-language models yields gains on spatial reasoning. We hope Utonia can serve as a step toward foundation models for sparse 3D data, and support downstream applications in AR/VR, robotics, and autonomous driving.