UniFunc3D: 3D 기능 분할을 위한 통합 활성 시공간 기반화

초록

3D 장면에서의 기능 분할은 에이전트가 암묵적인 자연어 명령을 세밀한 상호작용 요소들의 정확한 마스크로 정착시켜야 합니다. 기존 방법들은 초기 작업 파싱 과정에서 시각적 맹점을 겪는 단편적인 파이프라인에 의존합니다. 우리는 이러한 방법들이 단일 규모, 수동적이며 경험적인 프레임 선택에 의해 제한된다는 점을 관찰했습니다. 본 논문에서는 다중모드 대규모 언어 모델을 능동적 관찰자로 간주하는 통합적이고 훈련이 필요 없는 프레임워크인 UniFunc3D를 제시합니다. 의미론적, 시간적, 공간적 추론을 단일 정방향 전달로 통합함으로써 UniFunc3D는 직접적인 시각적 증거에 작업 분해를 정착시키기 위한 공동 추론을 수행합니다. 우리의 접근 방식은 coarse-to-fine 전략을 활용한 능동적 시공간 정착을 도입합니다. 이를 통해 모델은 올바른 비디오 프레임을 적응적으로 선택하고 모호성 해소에 필요한 전역 컨텍스트를 유지하면서 고해상도 상호작용 부위에 집중할 수 있습니다. SceneFun3D 벤치마크에서 UniFunc3D는 특정 작업 훈련 없이도 훈련 불필요 방법과 훈련 기반 방법을 모두 큰 차이로 능가하는 최첨단 성능(상대적 mIoU 59.9% 향상)을 달성했습니다. 코드는 프로젝트 페이지(https://jiaying.link/unifunc3d)에서 공개될 예정입니다.

English

Functionality segmentation in 3D scenes requires an agent to ground implicit natural-language instructions into precise masks of fine-grained interactive elements. Existing methods rely on fragmented pipelines that suffer from visual blindness during initial task parsing. We observe that these methods are limited by single-scale, passive and heuristic frame selection. We present UniFunc3D, a unified and training-free framework that treats the multimodal large language model as an active observer. By consolidating semantic, temporal, and spatial reasoning into a single forward pass, UniFunc3D performs joint reasoning to ground task decomposition in direct visual evidence. Our approach introduces active spatial-temporal grounding with a coarse-to-fine strategy. This allows the model to select correct video frames adaptively and focus on high-detail interactive parts while preserving the global context necessary for disambiguation. On SceneFun3D, UniFunc3D achieves state-of-the-art performance, surpassing both training-free and training-based methods by a large margin with a relative 59.9\% mIoU improvement, without any task-specific training. Code will be released on our project page: https://jiaying.link/unifunc3d.

UniFunc3D: 3D 기능 분할을 위한 통합 활성 시공간 기반화

UniFunc3D: Unified Active Spatial-Temporal Grounding for 3D Functionality Segmentation

초록

Support