UniT：基于分组自回归Transformer的统一几何学习

摘要

近期前馈模型在从传感器观测中推断密集三维结构的几何感知方面取得了显著进展。然而，其核心能力仍分散在多个互不兼容的范式中，包括在线感知、离线重建、多模态融合、长时程可扩展性以及度量尺度估计。本文提出UniT——一种基于新型组自回归Transformer的统一模型，将看似互不相关的各项能力整合到单一框架中。其核心思想是将传感器观测的组作为基本自回归单元，以无锚点且尺度自适应的方式预测对应的点图。具体而言，在线与离线设置中的多样化视图配置通过单一组自回归过程自然统一。通过调整组大小，在线模式以单帧组执行多步自回归，而离线模式则在单次前向传播中聚合多帧组。同时，队列式KV缓存机制确保长时程下的自回归内存有界性。该机制通过无锚点关系建模降低对早期帧的长程依赖，从而允许实时丢弃过时内存。为提升跨场景的度量尺度泛化能力，本文进一步引入尺度自适应几何损失函数。该函数将相对几何约束与部分绝对尺度项耦合，隐式正则化全局尺度，并引导从尺度不变几何向度量尺度解的渐进过渡。结合专用于融合辅助模态的模态注意力模块，UniT在统一几何感知中达到了当前最优性能，在涵盖七项代表性任务的十个基准上得到验证。

English

Recent feed-forward models have significantly advanced geometry perception for inferring dense 3D structure from sensor observations. However, its essential capabilities remain fragmented across multiple incompatible paradigms, including online perception, offline reconstruction, multi-modal integration, long-horizon scalability, and metric-scale estimation. We present UniT, a unified model built upon a novel Group Autoregressive Transformer, which reformulates these seemingly disparate capabilities within a single framework. The key idea is to treat groups of sensor observations as the basic autoregressive units and predict the corresponding point maps in an anchor-free and scale-adaptive manner. More specifically, diverse view configurations in both online and offline settings are naturally unified within a single group autoregression process. By varying the group size, online mode operates over multiple autoregressive steps with single-frame groups, whereas offline mode aggregates a multi-frame group in a single forward pass. Meanwhile, a queue-style KV caching mechanism ensures bounded autoregressive memory over long horizons. This is enabled by reducing long-range dependencies on early frames through anchor-free relational modeling, thereby allowing outdated memory to be discarded on the fly. To improve metric-scale generalization across scenes, a scale-adaptive geometry loss is further introduced within this framework. It couples relative geometric constraints with a partial absolute scale term, implicitly regularizing global scale and inducing a progressive transition from scale-invariant geometry to metric-scale solutions. Together with a dedicated modal attention module for integrating auxiliary modalities, UniT achieves state-of-the-art performance in unified geometry perception, as validated on ten benchmarks spanning seven representative tasks.