Utonia: 全点群に対応する単一エンコーダを目指して

要旨

我々は、あらゆる分野の点群が集い、単一のモデルを形成し、すべての分野に恩恵をもたらす未来を夢見ている。この目標に向けて、我々はUtoniaを提案する。これは、リモートセンシング、屋外LiDAR、屋内RGB-Dシーケンス、オブジェクト中心のCADモデル、RGB映像から変換された点群といった多様な分野にわたって、単一の自己教師ありポイントトランスフォーマーエンコーダーを訓練する第一歩である。センシングジオメトリ、密度、事前分布が大きく異なるにもかかわらず、Utoniaは分野を超えて転移可能な一貫した表現空間を学習する。この統一により知覚能力が向上するだけでなく、分野を統合して訓練した場合にのみ現れる興味深い創発的振る舞いが明らかとなる。知覚を超えて、Utonia表現が身体性を持つ推論やマルチモーダル推論にも有益であることを確認した：視覚-言語-行動ポリシーにUtonia特徴量を条件付けすることでロボット把持が改善され、視覚言語モデルに統合することで空間推論タスクで性能向上が得られる。Utoniaが疎3Dデータの基盤モデルへの一歩となり、AR/VR、ロボティクス、自動運転などの下流応用を支えることを願っている。

English

We dream of a future where point clouds from all domains can come together to shape a single model that benefits them all. Toward this goal, we present Utonia, a first step toward training a single self-supervised point transformer encoder across diverse domains, spanning remote sensing, outdoor LiDAR, indoor RGB-D sequences, object-centric CAD models, and point clouds lifted from RGB-only videos. Despite their distinct sensing geometries, densities, and priors, Utonia learns a consistent representation space that transfers across domains. This unification improves perception capability while revealing intriguing emergent behaviors that arise only when domains are trained jointly. Beyond perception, we observe that Utonia representations can also benefit embodied and multimodal reasoning: conditioning vision-language-action policies on Utonia features improves robotic manipulation, and integrating them into vision-language models yields gains on spatial reasoning. We hope Utonia can serve as a step toward foundation models for sparse 3D data, and support downstream applications in AR/VR, robotics, and autonomous driving.

Utonia: 全点群に対応する単一エンコーダを目指して

Utonia: Toward One Encoder for All Point Clouds

要旨

Support