OpenVoxel:面向開放詞彙3D場景理解的免訓練體素分組與標註技術
OpenVoxel: Training-Free Grouping and Captioning Voxels for Open-Vocabulary 3D Scene Understanding
January 14, 2026
作者: Sheng-Yu Huang, Jaesung Choe, Yu-Chiang Frank Wang, Cheng Sun
cs.AI
摘要
我們提出OpenVoxel——一種免訓練算法,用於對稀疏體素進行分組與描述,以實現開放詞彙的3D場景理解任務。基於從多視角影像獲得的稀疏體素柵格化(SVR)模型,OpenVoxel能生成描述場景中不同物件的意義分組。同時,通過運用強大的視覺語言模型(VLM)與多模態大型語言模型(MLLM),OpenVoxel能為每個分組生成描述性標註,從而建構具信息量的場景地圖,支援如開放詞彙分割(OVS)或指代表達式分割(RES)等進階3D場景理解任務。有別於先前方法,本方法無需訓練,且不引入CLIP/BERT文本編碼器的嵌入向量,而是直接透過MLLM進行文本到文本的搜尋。經大量實驗驗證,本方法在複雜的指代表達式分割(RES)任務中表現尤為突出,性能優於近期研究。程式碼將公開釋出。
English
We propose OpenVoxel, a training-free algorithm for grouping and captioning sparse voxels for the open-vocabulary 3D scene understanding tasks. Given the sparse voxel rasterization (SVR) model obtained from multi-view images of a 3D scene, our OpenVoxel is able to produce meaningful groups that describe different objects in the scene. Also, by leveraging powerful Vision Language Models (VLMs) and Multi-modal Large Language Models (MLLMs), our OpenVoxel successfully build an informative scene map by captioning each group, enabling further 3D scene understanding tasks such as open-vocabulary segmentation (OVS) or referring expression segmentation (RES). Unlike previous methods, our method is training-free and does not introduce embeddings from a CLIP/BERT text encoder. Instead, we directly proceed with text-to-text search using MLLMs. Through extensive experiments, our method demonstrates superior performance compared to recent studies, particularly in complex referring expression segmentation (RES) tasks. The code will be open.