Utonia：邁向適用於所有點雲的統一編碼器

摘要

我們夢想一個未來，所有領域的點雲數據都能匯聚成形，共同構建一個惠及全域的單一模型。為實現此目標，我們推出Utonia——這是邁向跨領域訓練單一自監督點雲Transformer編碼器的第一步，涵蓋遙感探測、室外LiDAR、室內RGB-D序列、以物體為中心的CAD模型，以及從純RGB影片提取的點雲數據。儘管這些數據存在感測幾何、密度與先驗知識的差異，Utonia仍能學習到跨領域通用的表徵空間。此統一性不僅提升感知能力，更揭示了僅在跨領域聯合訓練時才會湧現的驚人突現行為。除了感知任務，我們發現Utonia表徵亦能促進具身與多模態推理：將視覺-語言-動作策略以Utonia特徵為條件，可提升機器人操作效能；將其整合至視覺-語言模型後，更能增強空間推理能力。我們期盼Utonia能成為稀疏3D數據基礎模型的奠基之作，為AR/VR、機器人學與自動駕駛等下游應用提供支持。

English

We dream of a future where point clouds from all domains can come together to shape a single model that benefits them all. Toward this goal, we present Utonia, a first step toward training a single self-supervised point transformer encoder across diverse domains, spanning remote sensing, outdoor LiDAR, indoor RGB-D sequences, object-centric CAD models, and point clouds lifted from RGB-only videos. Despite their distinct sensing geometries, densities, and priors, Utonia learns a consistent representation space that transfers across domains. This unification improves perception capability while revealing intriguing emergent behaviors that arise only when domains are trained jointly. Beyond perception, we observe that Utonia representations can also benefit embodied and multimodal reasoning: conditioning vision-language-action policies on Utonia features improves robotic manipulation, and integrating them into vision-language models yields gains on spatial reasoning. We hope Utonia can serve as a step toward foundation models for sparse 3D data, and support downstream applications in AR/VR, robotics, and autonomous driving.

Utonia：邁向適用於所有點雲的統一編碼器

Utonia: Toward One Encoder for All Point Clouds

摘要

Support