Feature4X: 다목적 가우시안 특징 필드를 통해 단안 비디오를 4D 에이전트 AI로 연결하기

초록

최근 2D 및 다중모달 모델의 발전은 대규모 데이터셋을 활용한 광범위한 학습을 통해 놀라운 성과를 거두었습니다. 그러나 이러한 성과를 확장하여 복잡한 3D/4D 장면과의 자유로운 상호작용 및 고차원 의미적 작업을 가능하게 하는 것은 여전히 어려운 과제로 남아 있습니다. 이러한 어려움은 대규모로 주석이 달린 3D/4D 또는 다중 뷰 데이터셋의 제한된 가용성에서 비롯되며, 이는 개방형 어휘 및 프롬프트 기반 분할, 언어 기반 편집, 시각적 질의 응답(VQA)과 같은 일반화 가능한 시각 및 언어 작업에 필수적입니다. 본 논문에서는 사용자 생성 콘텐츠에서 널리 사용 가능한 단안 비디오 입력만을 사용하여 2D 시각 기반 모델의 모든 기능을 4D 영역으로 확장하도록 설계된 범용 프레임워크인 Feature4X를 소개합니다. Feature4X의 "X"는 다양한 작업을 가능하게 하는 적응형, 모델 조건부 4D 특징 필드 증류를 통해 그 다양성을 나타냅니다. 우리 프레임워크의 핵심은 여러 모델 기능을 단일 표현으로 통합하는 동적 최적화 전략입니다. 또한, 우리가 아는 한 Feature4X는 비디오 기반 모델(예: SAM2, InternVideo2)의 특징을 가우시안 스플래팅을 사용하여 명시적 4D 특징 필드로 증류 및 리프팅하는 첫 번째 방법입니다. 우리의 실험은 LLM을 활용한 피드백 루프를 통해 새로운 뷰에서의 분할, 기하학적 및 외관 장면 편집, 모든 시간 단계에서의 자유형 VQA를 보여줍니다. 이러한 발전은 몰입형 동적 4D 장면 상호작용이 가능한 확장 가능하고 맥락적, 시공간적으로 인지된 시스템을 위한 기반을 제공함으로써 에이전트 AI 응용의 범위를 넓힙니다.

English

Recent advancements in 2D and multimodal models have achieved remarkable success by leveraging large-scale training on extensive datasets. However, extending these achievements to enable free-form interactions and high-level semantic operations with complex 3D/4D scenes remains challenging. This difficulty stems from the limited availability of large-scale, annotated 3D/4D or multi-view datasets, which are crucial for generalizable vision and language tasks such as open-vocabulary and prompt-based segmentation, language-guided editing, and visual question answering (VQA). In this paper, we introduce Feature4X, a universal framework designed to extend any functionality from 2D vision foundation model into the 4D realm, using only monocular video input, which is widely available from user-generated content. The "X" in Feature4X represents its versatility, enabling any task through adaptable, model-conditioned 4D feature field distillation. At the core of our framework is a dynamic optimization strategy that unifies multiple model capabilities into a single representation. Additionally, to the best of our knowledge, Feature4X is the first method to distill and lift the features of video foundation models (e.g. SAM2, InternVideo2) into an explicit 4D feature field using Gaussian Splatting. Our experiments showcase novel view segment anything, geometric and appearance scene editing, and free-form VQA across all time steps, empowered by LLMs in feedback loops. These advancements broaden the scope of agentic AI applications by providing a foundation for scalable, contextually and spatiotemporally aware systems capable of immersive dynamic 4D scene interaction.

Feature4X: 다목적 가우시안 특징 필드를 통해 단안 비디오를 4D 에이전트 AI로 연결하기

Feature4X: Bridging Any Monocular Video to 4D Agentic AI with Versatile Gaussian Feature Fields

초록

Support