Feature4X：多様なガウシアン特徴フィールドによる任意の単眼動画から4DエージェントAIへの架け橋

要旨

近年の2Dおよびマルチモーダルモデルの進歩は、大規模なデータセットを用いたトレーニングにより目覚ましい成功を収めてきました。しかし、これらの成果を自由形式のインタラクションや複雑な3D/4Dシーンに対する高レベルの意味操作に拡張することは依然として困難です。この難しさは、大規模で注釈付きの3D/4Dまたはマルチビューデータセットの限られた可用性に起因しており、これらのデータセットはオープン語彙やプロンプトベースのセグメンテーション、言語ガイド編集、視覚的質問応答（VQA）などの汎用的な視覚と言語タスクに不可欠です。本論文では、Feature4Xを紹介します。これは、ユーザー生成コンテンツから広く利用可能な単眼ビデオ入力のみを使用して、2D視覚基盤モデルの任意の機能を4D領域に拡張するための汎用フレームワークです。Feature4Xの「X」はその汎用性を表しており、適応可能なモデル条件付き4D特徴場蒸留を通じて任意のタスクを可能にします。私たちのフレームワークの中核には、複数のモデル能力を単一の表現に統合する動的最適化戦略があります。さらに、私たちの知る限り、Feature4Xは、Gaussian Splattingを使用してビデオ基盤モデル（例：SAM2、InternVideo2）の特徴を明示的な4D特徴場に蒸留し、リフトする最初の方法です。私たちの実験では、フィードバックループにおけるLLMの力を借りて、新規ビューセグメンテーション、幾何学的および外観シーン編集、全時間ステップにわたる自由形式のVQAを実現しました。これらの進歩は、没入型の動的4Dシーンインタラクションが可能な、スケーラブルで文脈的および時空間的に認識されたシステムの基盤を提供することにより、エージェントAIアプリケーションの範囲を広げます。

English

Recent advancements in 2D and multimodal models have achieved remarkable success by leveraging large-scale training on extensive datasets. However, extending these achievements to enable free-form interactions and high-level semantic operations with complex 3D/4D scenes remains challenging. This difficulty stems from the limited availability of large-scale, annotated 3D/4D or multi-view datasets, which are crucial for generalizable vision and language tasks such as open-vocabulary and prompt-based segmentation, language-guided editing, and visual question answering (VQA). In this paper, we introduce Feature4X, a universal framework designed to extend any functionality from 2D vision foundation model into the 4D realm, using only monocular video input, which is widely available from user-generated content. The "X" in Feature4X represents its versatility, enabling any task through adaptable, model-conditioned 4D feature field distillation. At the core of our framework is a dynamic optimization strategy that unifies multiple model capabilities into a single representation. Additionally, to the best of our knowledge, Feature4X is the first method to distill and lift the features of video foundation models (e.g. SAM2, InternVideo2) into an explicit 4D feature field using Gaussian Splatting. Our experiments showcase novel view segment anything, geometric and appearance scene editing, and free-form VQA across all time steps, empowered by LLMs in feedback loops. These advancements broaden the scope of agentic AI applications by providing a foundation for scalable, contextually and spatiotemporally aware systems capable of immersive dynamic 4D scene interaction.

Feature4X：多様なガウシアン特徴フィールドによる任意の単眼動画から4DエージェントAIへの架け橋

Feature4X: Bridging Any Monocular Video to 4D Agentic AI with Versatile Gaussian Feature Fields

要旨

Support