ChatPaper.aiChatPaper

IGGT:基于实例接地的几何变换器语义三维重建

IGGT: Instance-Grounded Geometry Transformer for Semantic 3D Reconstruction

October 26, 2025
作者: Hao Li, Zhengyu Zou, Fangfu Liu, Xuanyang Zhang, Fangzhou Hong, Yukang Cao, Yushi Lan, Manyuan Zhang, Gang Yu, Dingwen Zhang, Ziwei Liu
cs.AI

摘要

人类自然地将三维世界的几何结构与语义内容视为相互交织的维度,这种认知机制使得我们能够对复杂场景形成连贯且精准的理解。然而,现有方法大多侧重于训练大型几何模型进行低层次三维重建,并将高层次空间理解作为独立任务处理,忽视了三维场景分析中这两个基本维度间的关键互动,从而限制了模型的泛化能力,导致下游三维理解任务表现不佳。近期研究尝试通过简单对齐三维模型与特定语言模型来缓解该问题,但这种方法将感知能力局限于对齐模型的固有容量,难以适应下游任务的多样化需求。本文提出实例化几何 Transformer(IGGT),这是一种端到端的大型统一Transformer架构,旨在融合空间重建与实例级上下文理解的双重知识。具体而言,我们设计了一种三维一致性对比学习策略,指导IGGT仅通过二维视觉输入,就能编码出融合几何结构与实例化聚类信息的统一表征。该表征支持将二维视觉输入一致性地提升为具有明确区分对象实例的连贯三维场景。为推进此项研究,我们进一步构建了InsScene-15K大规模数据集,该数据集通过新颖的数据构建流程,提供了高质量RGB图像、位姿、深度图及三维一致性实例级掩码标注。
English
Humans naturally perceive the geometric structure and semantic content of a 3D world as intertwined dimensions, enabling coherent and accurate understanding of complex scenes. However, most prior approaches prioritize training large geometry models for low-level 3D reconstruction and treat high-level spatial understanding in isolation, overlooking the crucial interplay between these two fundamental aspects of 3D-scene analysis, thereby limiting generalization and leading to poor performance in downstream 3D understanding tasks. Recent attempts have mitigated this issue by simply aligning 3D models with specific language models, thus restricting perception to the aligned model's capacity and limiting adaptability to downstream tasks. In this paper, we propose InstanceGrounded Geometry Transformer (IGGT), an end-to-end large unified transformer to unify the knowledge for both spatial reconstruction and instance-level contextual understanding. Specifically, we design a 3D-Consistent Contrastive Learning strategy that guides IGGT to encode a unified representation with geometric structures and instance-grounded clustering through only 2D visual inputs. This representation supports consistent lifting of 2D visual inputs into a coherent 3D scene with explicitly distinct object instances. To facilitate this task, we further construct InsScene-15K, a large-scale dataset with high-quality RGB images, poses, depth maps, and 3D-consistent instance-level mask annotations with a novel data curation pipeline.
PDF401December 31, 2025