UniFunc3D：面向三维功能分割的统一主动时空定位模型

摘要

三维场景中的功能分割要求智能体将隐式自然语言指令精准关联到细粒度交互元素的掩码上。现有方法依赖碎片化流程，在初始任务解析阶段存在视觉盲区。我们发现这些方法受限于单尺度、被动式和启发式的帧选取策略。本文提出UniFunc3D——一个以多模态大语言模型为主动观察者的统一免训练框架。通过将语义、时序和空间推理整合至单次前向传播，UniFunc3D执行联合推理，将任务分解过程锚定于直接视觉证据。我们的方法引入了从粗到精的主动时空定位机制，使模型能自适应选择正确视频帧，聚焦高细节的交互区域，同时保留消歧所需的全局上下文。在SceneFun3D基准测试中，UniFunc3D以59.9%的相对mIoU提升显著超越所有免训练与需训练方法，刷新最优性能记录，且无需任何任务特定训练。代码已发布在项目页面：https://jiaying.link/unifunc3d。

English

Functionality segmentation in 3D scenes requires an agent to ground implicit natural-language instructions into precise masks of fine-grained interactive elements. Existing methods rely on fragmented pipelines that suffer from visual blindness during initial task parsing. We observe that these methods are limited by single-scale, passive and heuristic frame selection. We present UniFunc3D, a unified and training-free framework that treats the multimodal large language model as an active observer. By consolidating semantic, temporal, and spatial reasoning into a single forward pass, UniFunc3D performs joint reasoning to ground task decomposition in direct visual evidence. Our approach introduces active spatial-temporal grounding with a coarse-to-fine strategy. This allows the model to select correct video frames adaptively and focus on high-detail interactive parts while preserving the global context necessary for disambiguation. On SceneFun3D, UniFunc3D achieves state-of-the-art performance, surpassing both training-free and training-based methods by a large margin with a relative 59.9\% mIoU improvement, without any task-specific training. Code will be released on our project page: https://jiaying.link/unifunc3d.

UniFunc3D：面向三维功能分割的统一主动时空定位模型

UniFunc3D: Unified Active Spatial-Temporal Grounding for 3D Functionality Segmentation

摘要

Support