ChatPaper.aiChatPaper

OpenVoxel:面向开放词汇3D场景理解的免训练体素分组与描述方法

OpenVoxel: Training-Free Grouping and Captioning Voxels for Open-Vocabulary 3D Scene Understanding

January 14, 2026
作者: Sheng-Yu Huang, Jaesung Choe, Yu-Chiang Frank Wang, Cheng Sun
cs.AI

摘要

我们提出OpenVoxel——一种无需训练的算法,用于对稀疏体素进行分组与描述,以支持开放词汇的3D场景理解任务。给定通过多视角图像获得的稀疏体素栅格化模型,OpenVoxel能够生成描述场景中不同物体的有意义分组。通过融合强大的视觉语言模型和多模态大语言模型,本方法可为每个分组生成描述性标注,从而构建信息丰富的场景地图,助力开放词汇分割、指代表达分割等下游任务。与现有方法不同,本方法无需训练过程,且不依赖CLIP/BERT文本编码器的嵌入表示,而是直接基于多模态大语言模型进行文本到文本的检索。大量实验表明,本方法在复杂指代表达分割任务中表现优于现有研究,代码将开源发布。
English
We propose OpenVoxel, a training-free algorithm for grouping and captioning sparse voxels for the open-vocabulary 3D scene understanding tasks. Given the sparse voxel rasterization (SVR) model obtained from multi-view images of a 3D scene, our OpenVoxel is able to produce meaningful groups that describe different objects in the scene. Also, by leveraging powerful Vision Language Models (VLMs) and Multi-modal Large Language Models (MLLMs), our OpenVoxel successfully build an informative scene map by captioning each group, enabling further 3D scene understanding tasks such as open-vocabulary segmentation (OVS) or referring expression segmentation (RES). Unlike previous methods, our method is training-free and does not introduce embeddings from a CLIP/BERT text encoder. Instead, we directly proceed with text-to-text search using MLLMs. Through extensive experiments, our method demonstrates superior performance compared to recent studies, particularly in complex referring expression segmentation (RES) tasks. The code will be open.
PDF223January 16, 2026