Feature4X：通过多功能高斯特征场将任意单目视频桥接至4D智能体AI

摘要

近期，二维及多模态模型通过大规模数据集上的训练取得了显著成功。然而，将这些成果扩展到实现与复杂三维/四维场景的自由交互及高级语义操作仍面临挑战。这一难题主要源于缺乏大规模、标注完整的三维/四维或多视角数据集，这些数据集对于开放词汇和基于提示的分割、语言引导编辑以及视觉问答（VQA）等可泛化的视觉与语言任务至关重要。本文中，我们提出了Feature4X，一个通用框架，旨在仅利用单目视频输入（广泛存在于用户生成内容中），将任何二维视觉基础模型的功能扩展至四维领域。Feature4X中的“X”象征其多功能性，通过可适应、模型条件化的四维特征场蒸馏，支持执行任何任务。我们框架的核心在于一种动态优化策略，它将多种模型能力统一于单一表示之中。此外，据我们所知，Feature4X是首个利用高斯溅射技术将视频基础模型（如SAM2、InternVideo2）的特征蒸馏并提升为显式四维特征场的方法。我们的实验展示了在LLM反馈循环支持下，跨所有时间步长的新视角任意分割、几何与外观场景编辑以及自由形式VQA。这些进展通过为可扩展、具备上下文和时空感知能力的系统奠定基础，拓宽了代理式AI应用的范围，使其能够沉浸式地动态交互四维场景。

English

Recent advancements in 2D and multimodal models have achieved remarkable success by leveraging large-scale training on extensive datasets. However, extending these achievements to enable free-form interactions and high-level semantic operations with complex 3D/4D scenes remains challenging. This difficulty stems from the limited availability of large-scale, annotated 3D/4D or multi-view datasets, which are crucial for generalizable vision and language tasks such as open-vocabulary and prompt-based segmentation, language-guided editing, and visual question answering (VQA). In this paper, we introduce Feature4X, a universal framework designed to extend any functionality from 2D vision foundation model into the 4D realm, using only monocular video input, which is widely available from user-generated content. The "X" in Feature4X represents its versatility, enabling any task through adaptable, model-conditioned 4D feature field distillation. At the core of our framework is a dynamic optimization strategy that unifies multiple model capabilities into a single representation. Additionally, to the best of our knowledge, Feature4X is the first method to distill and lift the features of video foundation models (e.g. SAM2, InternVideo2) into an explicit 4D feature field using Gaussian Splatting. Our experiments showcase novel view segment anything, geometric and appearance scene editing, and free-form VQA across all time steps, empowered by LLMs in feedback loops. These advancements broaden the scope of agentic AI applications by providing a foundation for scalable, contextually and spatiotemporally aware systems capable of immersive dynamic 4D scene interaction.