ChatPaper.aiChatPaper

Group3D:基于多模态大语言模型的开放词汇3维物体检测语义分组方法

Group3D: MLLM-Driven Semantic Grouping for Open-Vocabulary 3D Object Detection

March 23, 2026
作者: Youbin Kim, Jinho Park, Hogun Park, Eunbyung Park
cs.AI

摘要

开放词汇3D物体检测旨在定位和识别超出固定训练分类体系的对象。在多视角RGB场景中,现有方法通常将基于几何的实例构建与语义标注解耦,先生成类别无关的片段再后验分配开放词汇类别。虽然灵活,但这种解耦使实例构建主要受几何一致性主导,在合并过程中缺乏语义约束。当几何证据存在视角依赖性和不完整性时,这种纯几何合并可能导致不可逆的关联错误,包括不同物体的过度合并或单个实例的碎片化。我们提出Group3D——一种将语义约束直接整合到实例构建过程中的多视角开放词汇3D检测框架。该方法通过多模态大语言模型维护场景自适应词汇表,并将其组织为编码合理跨视角类别等价关系的语义兼容组。这些组别作为合并时的约束条件:3D片段仅当同时满足语义兼容性和几何一致性时才会被关联。这种语义门控合并机制在吸收多视角类别变异性的同时,能有效缓解几何驱动导致的过度合并问题。Group3D支持位姿已知和位姿无关两种设置,仅依赖RGB观测数据。在ScanNet和ARKitScenes上的实验表明,Group3D在多视角开放词汇3D检测中实现了最先进的性能,并在零样本场景中展现出强大的泛化能力。项目页面详见https://ubin108.github.io/Group3D/。
English
Open-vocabulary 3D object detection aims to localize and recognize objects beyond a fixed training taxonomy. In multi-view RGB settings, recent approaches often decouple geometry-based instance construction from semantic labeling, generating class-agnostic fragments and assigning open-vocabulary categories post hoc. While flexible, such decoupling leaves instance construction governed primarily by geometric consistency, without semantic constraints during merging. When geometric evidence is view-dependent and incomplete, this geometry-only merging can lead to irreversible association errors, including over-merging of distinct objects or fragmentation of a single instance. We propose Group3D, a multi-view open-vocabulary 3D detection framework that integrates semantic constraints directly into the instance construction process. Group3D maintains a scene-adaptive vocabulary derived from a multimodal large language model (MLLM) and organizes it into semantic compatibility groups that encode plausible cross-view category equivalence. These groups act as merge-time constraints: 3D fragments are associated only when they satisfy both semantic compatibility and geometric consistency. This semantically gated merging mitigates geometry-driven over-merging while absorbing multi-view category variability. Group3D supports both pose-known and pose-free settings, relying only on RGB observations. Experiments on ScanNet and ARKitScenes demonstrate that Group3D achieves state-of-the-art performance in multi-view open-vocabulary 3D detection, while exhibiting strong generalization in zero-shot scenarios. The project page is available at https://ubin108.github.io/Group3D/.
PDF252March 25, 2026