UniT：基於群自回歸變換器的統一幾何學習

摘要

近年來的前饋模型在從感測觀測推斷密集三維結構的幾何感知方面取得了顯著進展。然而，其核心能力仍分散於多種不相容的典範之中，包括線上感知、離線重建、多模態整合、長時域可擴展性及度量尺度估計。我們提出UniT，這是一個基於新穎的群組自迴歸Transformer（Group Autoregressive Transformer）所建構的統一模型，將這些看似互異的能力重構於單一架構中。其核心思想是將群組的感測觀測視為基本的自迴歸單元，並以無錨點且尺度自適應的方式預測對應的點地圖。更具體來說，在線上與離線設定下的多樣視角配置，可自然地統一於單一群組自迴歸過程中。透過調整群組大小，線上模式以單幀群組進行多個自迴歸步驟，而離線模式則在單次前向傳遞中聚合多幀群組。同時，佇列式的鍵值快取機制確保了在長時域上的有界自迴歸記憶體。這是透過無錨點的關係建模來降低對早期幀的長程依賴性，從而允許即時捨棄過時的記憶體。為提升跨場景的度量尺度泛化能力，本框架中進一步引入了尺度自適應的幾何損失函數。該損失函數將相對幾何約束與部分絕對尺度項耦合，隱式地正則化全域尺度，並引發從尺度不變幾何到度量尺度解決方案的漸進過渡。結合專為整合輔助模態而設計的模態注意力模組，UniT在涵蓋七項代表性任務的十個基準測試中，實現了統一的幾何感知最新效能。

English

Recent feed-forward models have significantly advanced geometry perception for inferring dense 3D structure from sensor observations. However, its essential capabilities remain fragmented across multiple incompatible paradigms, including online perception, offline reconstruction, multi-modal integration, long-horizon scalability, and metric-scale estimation. We present UniT, a unified model built upon a novel Group Autoregressive Transformer, which reformulates these seemingly disparate capabilities within a single framework. The key idea is to treat groups of sensor observations as the basic autoregressive units and predict the corresponding point maps in an anchor-free and scale-adaptive manner. More specifically, diverse view configurations in both online and offline settings are naturally unified within a single group autoregression process. By varying the group size, online mode operates over multiple autoregressive steps with single-frame groups, whereas offline mode aggregates a multi-frame group in a single forward pass. Meanwhile, a queue-style KV caching mechanism ensures bounded autoregressive memory over long horizons. This is enabled by reducing long-range dependencies on early frames through anchor-free relational modeling, thereby allowing outdated memory to be discarded on the fly. To improve metric-scale generalization across scenes, a scale-adaptive geometry loss is further introduced within this framework. It couples relative geometric constraints with a partial absolute scale term, implicitly regularizing global scale and inducing a progressive transition from scale-invariant geometry to metric-scale solutions. Together with a dedicated modal attention module for integrating auxiliary modalities, UniT achieves state-of-the-art performance in unified geometry perception, as validated on ten benchmarks spanning seven representative tasks.