4Real-Video-V2：融合视角-时间注意力与前馈重建的4D场景生成

摘要

我们提出了首个能够通过前馈架构计算每一时间步视频帧和3D高斯粒子4D时空网格的框架。该架构包含两个主要组件：4D视频模型和4D重建模型。在第一部分中，我们分析了当前执行空间与时间注意力机制的4D视频扩散架构，这些架构或采用顺序处理，或在双流设计中并行处理。我们指出了现有方法的局限性，并引入了一种新颖的融合架构，该架构在单层内同时执行空间与时间注意力。我们方法的关键在于一种稀疏注意力模式，其中token仅关注同一帧内、同一时间戳下或同一视角下的其他token。在第二部分中，我们通过引入高斯头、相机token替换算法以及额外的动态层和训练，对现有3D重建算法进行了扩展。总体而言，我们在4D生成领域确立了新的技术标杆，显著提升了视觉质量和重建能力。

English

We propose the first framework capable of computing a 4D spatio-temporal grid of video frames and 3D Gaussian particles for each time step using a feed-forward architecture. Our architecture has two main components, a 4D video model and a 4D reconstruction model. In the first part, we analyze current 4D video diffusion architectures that perform spatial and temporal attention either sequentially or in parallel within a two-stream design. We highlight the limitations of existing approaches and introduce a novel fused architecture that performs spatial and temporal attention within a single layer. The key to our method is a sparse attention pattern, where tokens attend to others in the same frame, at the same timestamp, or from the same viewpoint. In the second part, we extend existing 3D reconstruction algorithms by introducing a Gaussian head, a camera token replacement algorithm, and additional dynamic layers and training. Overall, we establish a new state of the art for 4D generation, improving both visual quality and reconstruction capability.

4Real-Video-V2：融合视角-时间注意力与前馈重建的4D场景生成

4Real-Video-V2: Fused View-Time Attention and Feedforward Reconstruction for 4D Scene Generation

摘要

Support