Feature4X：透過多功能高斯特徵場將任意單目視頻橋接至4D代理AI

摘要

近期，二維和多模態模型的進展通過大規模數據集上的訓練取得了顯著成功。然而，將這些成就擴展到實現與複雜三維/四維場景的自由交互和高層次語義操作仍然具有挑戰性。這一困難源於大規模、註釋完整的三維/四維或多視角數據集的稀缺，這些數據集對於開放詞彙和基於提示的分割、語言引導的編輯以及視覺問答（VQA）等可泛化的視覺與語言任務至關重要。本文中，我們介紹了Feature4X，這是一個通用框架，旨在僅利用單目視頻輸入（廣泛存在於用戶生成內容中）將二維視覺基礎模型的任何功能擴展到四維領域。Feature4X中的“X”代表其多功能性，能夠通過可適應的、模型條件化的四維特徵場蒸餾實現任何任務。我們框架的核心是一種動態優化策略，將多種模型能力統一為單一表示。此外，據我們所知，Feature4X是首個利用高斯潑濺技術將視頻基礎模型（如SAM2、InternVideo2）的特徵蒸餾並提升為顯式四維特徵場的方法。我們的實驗展示了新視角下的任意分割、幾何和外觀場景編輯，以及所有時間步上的自由形式VQA，這些都得益於反饋循環中的大型語言模型（LLMs）支持。這些進展通過提供一個可擴展、上下文感知且時空感知的系統基礎，拓寬了智能代理AI應用的範圍，使其能夠實現沉浸式的動態四維場景交互。

English

Recent advancements in 2D and multimodal models have achieved remarkable success by leveraging large-scale training on extensive datasets. However, extending these achievements to enable free-form interactions and high-level semantic operations with complex 3D/4D scenes remains challenging. This difficulty stems from the limited availability of large-scale, annotated 3D/4D or multi-view datasets, which are crucial for generalizable vision and language tasks such as open-vocabulary and prompt-based segmentation, language-guided editing, and visual question answering (VQA). In this paper, we introduce Feature4X, a universal framework designed to extend any functionality from 2D vision foundation model into the 4D realm, using only monocular video input, which is widely available from user-generated content. The "X" in Feature4X represents its versatility, enabling any task through adaptable, model-conditioned 4D feature field distillation. At the core of our framework is a dynamic optimization strategy that unifies multiple model capabilities into a single representation. Additionally, to the best of our knowledge, Feature4X is the first method to distill and lift the features of video foundation models (e.g. SAM2, InternVideo2) into an explicit 4D feature field using Gaussian Splatting. Our experiments showcase novel view segment anything, geometric and appearance scene editing, and free-form VQA across all time steps, empowered by LLMs in feedback loops. These advancements broaden the scope of agentic AI applications by providing a foundation for scalable, contextually and spatiotemporally aware systems capable of immersive dynamic 4D scene interaction.

Feature4X：透過多功能高斯特徵場將任意單目視頻橋接至4D代理AI

Feature4X: Bridging Any Monocular Video to 4D Agentic AI with Versatile Gaussian Feature Fields

摘要

Support