ChatPaper.aiChatPaper

OpenVoxel:面向開放詞彙3D場景理解的免訓練體素分組與標註技術

OpenVoxel: Training-Free Grouping and Captioning Voxels for Open-Vocabulary 3D Scene Understanding

January 14, 2026
作者: Sheng-Yu Huang, Jaesung Choe, Yu-Chiang Frank Wang, Cheng Sun
cs.AI

摘要

我們提出OpenVoxel——一種免訓練算法,用於對稀疏體素進行分組與描述,以實現開放詞彙的3D場景理解任務。基於從多視角影像獲得的稀疏體素柵格化(SVR)模型,OpenVoxel能生成描述場景中不同物件的意義分組。同時,通過運用強大的視覺語言模型(VLM)與多模態大型語言模型(MLLM),OpenVoxel能為每個分組生成描述性標註,從而建構具信息量的場景地圖,支援如開放詞彙分割(OVS)或指代表達式分割(RES)等進階3D場景理解任務。有別於先前方法,本方法無需訓練,且不引入CLIP/BERT文本編碼器的嵌入向量,而是直接透過MLLM進行文本到文本的搜尋。經大量實驗驗證,本方法在複雜的指代表達式分割(RES)任務中表現尤為突出,性能優於近期研究。程式碼將公開釋出。
English
We propose OpenVoxel, a training-free algorithm for grouping and captioning sparse voxels for the open-vocabulary 3D scene understanding tasks. Given the sparse voxel rasterization (SVR) model obtained from multi-view images of a 3D scene, our OpenVoxel is able to produce meaningful groups that describe different objects in the scene. Also, by leveraging powerful Vision Language Models (VLMs) and Multi-modal Large Language Models (MLLMs), our OpenVoxel successfully build an informative scene map by captioning each group, enabling further 3D scene understanding tasks such as open-vocabulary segmentation (OVS) or referring expression segmentation (RES). Unlike previous methods, our method is training-free and does not introduce embeddings from a CLIP/BERT text encoder. Instead, we directly proceed with text-to-text search using MLLMs. Through extensive experiments, our method demonstrates superior performance compared to recent studies, particularly in complex referring expression segmentation (RES) tasks. The code will be open.
PDF223January 16, 2026