UniT: 그룹 자기회귀 트랜스포머를 활용한 통합 기하학 학습

초록

최근 피드포워드 모델들은 센서 관측값으로부터 밀집 3D 구조를 추론하는 기하 인식 분야에서 상당한 발전을 이루었다. 그러나 그 핵심 기능들은 온라인 인식, 오프라인 복원, 다중 모달 통합, 장기 확장성, 미터법 스케일 추정 등 상호 호환되지 않는 여러 패러다임에 걸쳐 분산되어 있다. 본 논문에서는 이러한 이질적인 기능들을 단일 프레임워크 내에서 재구성하는 새로운 Group Autoregressive Transformer에 기반한 통합 모델 UniT를 제안한다. 핵심 아이디어는 센서 관측값 그룹을 기본 자기회귀 단위로 처리하고, 앵커 프리 및 스케일 적응 방식으로 대응하는 포인트 맵을 예측하는 것이다. 구체적으로, 온라인 및 오프라인 환경 모두에서의 다양한 시점 구성은 단일 그룹 자기회귀 과정 내에서 자연스럽게 통합된다. 그룹 크기를 변화시킴으로써, 온라인 모드는 단일 프레임 그룹으로 여러 자기회귀 단계에 걸쳐 작동하는 반면, 오프라인 모드는 단일 순방향 전달에서 다중 프레임 그룹을 집계한다. 동시에 큐 스타일의 KV 캐싱 메커니즘은 장기적인 수평선에 걸쳐 제한된 자기회귀 메모리를 보장한다. 이는 앵커 프리 관계 모델링을 통해 초기 프레임에 대한 장기 의존성을 줄여, 오래된 메모리를 즉시 폐기할 수 있게 함으로써 가능해진다. 장면 간 미터법 스케일 일반화를 개선하기 위해, 이 프레임워크 내에 스케일 적응형 기하 손실 함수가 추가로 도입되었다. 이 손실 함수는 상대적 기하 제약 조건과 부분적인 절대 스케일 항을 결합하여, 전역 스케일을 암시적으로 정규화하고 스케일 불변 기하에서 미터법 스케일 해로의 점진적 전환을 유도한다. 보조 모달리티를 통합하기 위한 전용 모달 어텐션 모듈과 함께, UniT는 7가지 대표 작업에 걸친 10개의 벤치마크에서 검증된 바와 같이 통합 기하 인식에서 최첨단 성능을 달성한다.

English

Recent feed-forward models have significantly advanced geometry perception for inferring dense 3D structure from sensor observations. However, its essential capabilities remain fragmented across multiple incompatible paradigms, including online perception, offline reconstruction, multi-modal integration, long-horizon scalability, and metric-scale estimation. We present UniT, a unified model built upon a novel Group Autoregressive Transformer, which reformulates these seemingly disparate capabilities within a single framework. The key idea is to treat groups of sensor observations as the basic autoregressive units and predict the corresponding point maps in an anchor-free and scale-adaptive manner. More specifically, diverse view configurations in both online and offline settings are naturally unified within a single group autoregression process. By varying the group size, online mode operates over multiple autoregressive steps with single-frame groups, whereas offline mode aggregates a multi-frame group in a single forward pass. Meanwhile, a queue-style KV caching mechanism ensures bounded autoregressive memory over long horizons. This is enabled by reducing long-range dependencies on early frames through anchor-free relational modeling, thereby allowing outdated memory to be discarded on the fly. To improve metric-scale generalization across scenes, a scale-adaptive geometry loss is further introduced within this framework. It couples relative geometric constraints with a partial absolute scale term, implicitly regularizing global scale and inducing a progressive transition from scale-invariant geometry to metric-scale solutions. Together with a dedicated modal attention module for integrating auxiliary modalities, UniT achieves state-of-the-art performance in unified geometry perception, as validated on ten benchmarks spanning seven representative tasks.