UniT: グループ自己回帰トランスフォーマーを用いた統一幾何学学習

要旨

近年のフィードフォワードモデルは、センサ観測から高密度な3次元構造を推定するための幾何学的認識を大幅に進歩させてきました。しかし、その本質的な能力は、オンライン認識、オフライン再構築、マルチモーダル統合、長期的スケーラビリティ、メートルスケール推定といった、互換性のない複数のパラダイムに断片化されたままです。本稿では、新しいグループ自己回帰トランスフォーマー（Group Autoregressive Transformer）に基づく統一モデルUniTを提案します。これは、一見異なるこれらの能力を単一のフレームワークで再構築するものです。鍵となるアイデアは、センサ観測のグループを基本的な自己回帰単位として扱い、アンカーフリーかつスケール適応的に、対応する点群マップを予測することです。より具体的には、オンライン設定とオフライン設定の両方における多様な視点構成を、単一のグループ自己回帰プロセスで自然に統合します。グループサイズを変えることで、オンラインモードは単一フレームグループを用いた複数の自己回帰ステップで動作し、オフラインモードは複数フレームグループを単一のフォワードパスで集約します。同時に、キュー形式のKVキャッシングメカニズムにより、長い処理期間にわたって自己回帰のメモリを bounded に保ちます。これは、アンカーフリーの関係モデリングを通じて初期フレームへの長期的な依存関係を低減し、古いメモリをオンザフライで破棄できるようにすることで実現されます。また、シーン間でのメートルスケールの汎化を向上させるために、このフレームワーク内でスケール適応型幾何損失（scale-adaptive geometry loss）を新たに導入します。これは、相対的な幾何学的制約と部分的な絶対スケール項を結合し、グローバルスケールを暗黙的に正則化するとともに、スケール不変な幾何学からメートルスケール解への漸進的な移行を誘導します。さらに、補助モダリティを統合するための専用のモーダルアテンションモジュールと組み合わせることで、UniTは7つの代表的なタスクをカバーする10のベンチマークで検証された、統一的な幾何学的認識において最先端の性能を達成します。

English

Recent feed-forward models have significantly advanced geometry perception for inferring dense 3D structure from sensor observations. However, its essential capabilities remain fragmented across multiple incompatible paradigms, including online perception, offline reconstruction, multi-modal integration, long-horizon scalability, and metric-scale estimation. We present UniT, a unified model built upon a novel Group Autoregressive Transformer, which reformulates these seemingly disparate capabilities within a single framework. The key idea is to treat groups of sensor observations as the basic autoregressive units and predict the corresponding point maps in an anchor-free and scale-adaptive manner. More specifically, diverse view configurations in both online and offline settings are naturally unified within a single group autoregression process. By varying the group size, online mode operates over multiple autoregressive steps with single-frame groups, whereas offline mode aggregates a multi-frame group in a single forward pass. Meanwhile, a queue-style KV caching mechanism ensures bounded autoregressive memory over long horizons. This is enabled by reducing long-range dependencies on early frames through anchor-free relational modeling, thereby allowing outdated memory to be discarded on the fly. To improve metric-scale generalization across scenes, a scale-adaptive geometry loss is further introduced within this framework. It couples relative geometric constraints with a partial absolute scale term, implicitly regularizing global scale and inducing a progressive transition from scale-invariant geometry to metric-scale solutions. Together with a dedicated modal attention module for integrating auxiliary modalities, UniT achieves state-of-the-art performance in unified geometry perception, as validated on ten benchmarks spanning seven representative tasks.