UniFunc3D：面向三维功能分割的统一主动时空定位系统

摘要

三維場景中的功能分割任務要求智能體將隱性自然語言指令精確對應到細粒度交互元素的掩碼上。現有方法依賴於碎片化的處理流程，在初始任務解析階段存在視覺盲區。我們發現這些方法受限于單一尺度、被動式及啟發式的幀選取策略。為此，我們提出UniFunc3D——一個統一的免訓練框架，將多模態大語言模型作為主動觀察者。通過將語義、時序和空間推理整合到單次前向傳播中，UniFunc3D能進行聯合推理，將任務分解過程錨定於直接視覺證據。本方法採用由粗到細的主動時空定位機制，使模型能自適應選取正確視頻幀，在保持消歧所需全局上下文的同时，聚焦於高細節的交互部件。在SceneFun3D數據集上，UniFunc3D以59.9%的相對mIoU提升大幅超越現有免訓練與需訓練方法，實現最優性能，且無需任何任務特定訓練。代碼將發佈於項目頁面：https://jiaying.link/unifunc3d。

English

Functionality segmentation in 3D scenes requires an agent to ground implicit natural-language instructions into precise masks of fine-grained interactive elements. Existing methods rely on fragmented pipelines that suffer from visual blindness during initial task parsing. We observe that these methods are limited by single-scale, passive and heuristic frame selection. We present UniFunc3D, a unified and training-free framework that treats the multimodal large language model as an active observer. By consolidating semantic, temporal, and spatial reasoning into a single forward pass, UniFunc3D performs joint reasoning to ground task decomposition in direct visual evidence. Our approach introduces active spatial-temporal grounding with a coarse-to-fine strategy. This allows the model to select correct video frames adaptively and focus on high-detail interactive parts while preserving the global context necessary for disambiguation. On SceneFun3D, UniFunc3D achieves state-of-the-art performance, surpassing both training-free and training-based methods by a large margin with a relative 59.9\% mIoU improvement, without any task-specific training. Code will be released on our project page: https://jiaying.link/unifunc3d.

UniFunc3D：面向三维功能分割的统一主动时空定位系统

UniFunc3D: Unified Active Spatial-Temporal Grounding for 3D Functionality Segmentation

摘要

Support