UniFunc3D: 3D機能セグメンテーションのための統合型能動的時空間グラウンディング

要旨

3Dシーンにおける機能的分割では、エージェントが暗黙的な自然言語指示を、細粒度の対話要素の精密なマスクとして接地（グラウンディング）する必要がある。既存手法は、初期のタスク解析段階で視覚的ブラインドネスが生じる断片化されたパイプラインに依存している。我々は、これらの手法が単一スケールの、受動的でヒューリスティックなフレーム選択に制限されていることを観察した。本論文では、マルチモーダル大規模言語モデルを能動的観察者として扱う、統合的かつ訓練不要なフレームワークであるUniFunc3Dを提案する。意味的、時間的、空間的推論を単一のフォワードパスに統合することで、UniFunc3Dは共同推論を実行し、タスク分解を直接的な視覚的証拠に基づいて接地する。我々のアプローチは、粗から細への戦略を用いた能動的時空間接地を導入する。これにより、モデルは適応的に正しいビデオフレームを選択し、曖昧性解消に必要な大域的文脈を保ちながら、高詳細な対話部分に焦点を当てることが可能となる。SceneFun3Dベンチマークにおいて、UniFunc3Dは訓練不要および訓練ベースの両方の手法を大幅に上回る最高性能を達成し、タスク固有の訓練を一切行わずに、相対的に59.9%のmIoU向上を実現した。コードはプロジェクトページ（https://jiaying.link/unifunc3d ）で公開予定である。

English

Functionality segmentation in 3D scenes requires an agent to ground implicit natural-language instructions into precise masks of fine-grained interactive elements. Existing methods rely on fragmented pipelines that suffer from visual blindness during initial task parsing. We observe that these methods are limited by single-scale, passive and heuristic frame selection. We present UniFunc3D, a unified and training-free framework that treats the multimodal large language model as an active observer. By consolidating semantic, temporal, and spatial reasoning into a single forward pass, UniFunc3D performs joint reasoning to ground task decomposition in direct visual evidence. Our approach introduces active spatial-temporal grounding with a coarse-to-fine strategy. This allows the model to select correct video frames adaptively and focus on high-detail interactive parts while preserving the global context necessary for disambiguation. On SceneFun3D, UniFunc3D achieves state-of-the-art performance, surpassing both training-free and training-based methods by a large margin with a relative 59.9\% mIoU improvement, without any task-specific training. Code will be released on our project page: https://jiaying.link/unifunc3d.

UniFunc3D: 3D機能セグメンテーションのための統合型能動的時空間グラウンディング

UniFunc3D: Unified Active Spatial-Temporal Grounding for 3D Functionality Segmentation

要旨

Support