4Real-Video-V2:融合視角-時間注意力與前饋重建的四維場景生成技術
4Real-Video-V2: Fused View-Time Attention and Feedforward Reconstruction for 4D Scene Generation
June 18, 2025
作者: Chaoyang Wang, Ashkan Mirzaei, Vidit Goel, Willi Menapace, Aliaksandr Siarohin, Avalon Vinella, Michael Vasilkovsky, Ivan Skorokhodov, Vladislav Shakhrai, Sergey Korolev, Sergey Tulyakov, Peter Wonka
cs.AI
摘要
我們提出首個能夠利用前饋架構計算每一時間步的視頻幀及三維高斯粒子四維時空網格的框架。該架構包含兩個主要部分:四維視頻模型與四維重建模型。在第一部分中,我們分析了現有的四維視頻擴散架構,這些架構在雙流設計中依次或並行執行空間與時間注意力機制。我們指出了現有方法的局限性,並引入了一種新穎的融合架構,該架構在單一層內實現空間與時間注意力的同步處理。我們方法的關鍵在於一種稀疏注意力模式,其中標記僅關注同一幀、同一時間戳或同一視角下的其他標記。在第二部分中,我們通過引入高斯頭部、相機標記替換算法以及額外的動態層與訓練,對現有的三維重建算法進行了擴展。總體而言,我們在四維生成領域確立了新的技術前沿,顯著提升了視覺質量與重建能力。
English
We propose the first framework capable of computing a 4D spatio-temporal grid
of video frames and 3D Gaussian particles for each time step using a
feed-forward architecture. Our architecture has two main components, a 4D video
model and a 4D reconstruction model. In the first part, we analyze current 4D
video diffusion architectures that perform spatial and temporal attention
either sequentially or in parallel within a two-stream design. We highlight the
limitations of existing approaches and introduce a novel fused architecture
that performs spatial and temporal attention within a single layer. The key to
our method is a sparse attention pattern, where tokens attend to others in the
same frame, at the same timestamp, or from the same viewpoint. In the second
part, we extend existing 3D reconstruction algorithms by introducing a Gaussian
head, a camera token replacement algorithm, and additional dynamic layers and
training. Overall, we establish a new state of the art for 4D generation,
improving both visual quality and reconstruction capability.