OpenVoxel:面向开放词汇3D场景理解的免训练体素分组与描述方法
OpenVoxel: Training-Free Grouping and Captioning Voxels for Open-Vocabulary 3D Scene Understanding
January 14, 2026
作者: Sheng-Yu Huang, Jaesung Choe, Yu-Chiang Frank Wang, Cheng Sun
cs.AI
摘要
我们提出OpenVoxel——一种无需训练的算法,用于对稀疏体素进行分组与描述,以支持开放词汇的3D场景理解任务。给定通过多视角图像获得的稀疏体素栅格化模型,OpenVoxel能够生成描述场景中不同物体的有意义分组。通过融合强大的视觉语言模型和多模态大语言模型,本方法可为每个分组生成描述性标注,从而构建信息丰富的场景地图,助力开放词汇分割、指代表达分割等下游任务。与现有方法不同,本方法无需训练过程,且不依赖CLIP/BERT文本编码器的嵌入表示,而是直接基于多模态大语言模型进行文本到文本的检索。大量实验表明,本方法在复杂指代表达分割任务中表现优于现有研究,代码将开源发布。
English
We propose OpenVoxel, a training-free algorithm for grouping and captioning sparse voxels for the open-vocabulary 3D scene understanding tasks. Given the sparse voxel rasterization (SVR) model obtained from multi-view images of a 3D scene, our OpenVoxel is able to produce meaningful groups that describe different objects in the scene. Also, by leveraging powerful Vision Language Models (VLMs) and Multi-modal Large Language Models (MLLMs), our OpenVoxel successfully build an informative scene map by captioning each group, enabling further 3D scene understanding tasks such as open-vocabulary segmentation (OVS) or referring expression segmentation (RES). Unlike previous methods, our method is training-free and does not introduce embeddings from a CLIP/BERT text encoder. Instead, we directly proceed with text-to-text search using MLLMs. Through extensive experiments, our method demonstrates superior performance compared to recent studies, particularly in complex referring expression segmentation (RES) tasks. The code will be open.