ChatPaper.aiChatPaper

IGGT:基於實例的幾何變壓器語義化三維重建

IGGT: Instance-Grounded Geometry Transformer for Semantic 3D Reconstruction

October 26, 2025
作者: Hao Li, Zhengyu Zou, Fangfu Liu, Xuanyang Zhang, Fangzhou Hong, Yukang Cao, Yushi Lan, Manyuan Zhang, Gang Yu, Dingwen Zhang, Ziwei Liu
cs.AI

摘要

人類天生能將三維世界的幾何結構與語義內容視為相互交織的維度,從而實現對複雜場景的連貫精準認知。然而過往多數方法側重於訓練大型幾何模型進行低階三維重建,並將高階空間理解獨立處理,忽視了這兩個三維場景分析基礎維度間的關鍵互動,導致泛化能力受限且在下游三維理解任務中表現不佳。近期研究嘗試通過簡單對齊三維模型與特定語言模型來緩解此問題,但這將感知能力侷限於對齊模型的容量,難以適應下游任務需求。本文提出實例錨定幾何轉換器(IGGT),這是一種端到端的大型統一轉換器,能整合空間重建與實例級上下文理解的知識。具體而言,我們設計了「三維一致性對比學習」策略,引導IGGT僅通過二維視覺輸入,編碼出融合幾何結構與實例錨定聚類的統一表徵。該表徵支持將二維視覺輸入一致性提升為具有明確區分物體實例的連貫三維場景。為推進此任務,我們進一步構建InsScene-15K大規模數據集,包含高質量RGB圖像、位姿、深度圖,以及通過創新數據篩選流程生成的三維一致性實例級遮罩標註。
English
Humans naturally perceive the geometric structure and semantic content of a 3D world as intertwined dimensions, enabling coherent and accurate understanding of complex scenes. However, most prior approaches prioritize training large geometry models for low-level 3D reconstruction and treat high-level spatial understanding in isolation, overlooking the crucial interplay between these two fundamental aspects of 3D-scene analysis, thereby limiting generalization and leading to poor performance in downstream 3D understanding tasks. Recent attempts have mitigated this issue by simply aligning 3D models with specific language models, thus restricting perception to the aligned model's capacity and limiting adaptability to downstream tasks. In this paper, we propose InstanceGrounded Geometry Transformer (IGGT), an end-to-end large unified transformer to unify the knowledge for both spatial reconstruction and instance-level contextual understanding. Specifically, we design a 3D-Consistent Contrastive Learning strategy that guides IGGT to encode a unified representation with geometric structures and instance-grounded clustering through only 2D visual inputs. This representation supports consistent lifting of 2D visual inputs into a coherent 3D scene with explicitly distinct object instances. To facilitate this task, we further construct InsScene-15K, a large-scale dataset with high-quality RGB images, poses, depth maps, and 3D-consistent instance-level mask annotations with a novel data curation pipeline.
PDF401December 31, 2025