4Real-Video-V2: 4D 장면 생성을 위한 융합된 시점-시간 어텐션 및 피드포워드 재구성

초록

본 논문에서는 피드포워드 아키텍처를 사용하여 각 시간 단계에서 비디오 프레임과 3D 가우시안 입자로 구성된 4D 시공간 그리드를 계산할 수 있는 최초의 프레임워크를 제안한다. 우리의 아키텍처는 4D 비디오 모델과 4D 재구성 모델이라는 두 가지 주요 구성 요소로 이루어져 있다. 첫 번째 부분에서는 공간 및 시간 어텐션을 순차적으로 또는 병렬로 수행하는 현재의 4D 비디오 확산 아키텍처를 두 스트림 설계 내에서 분석한다. 기존 접근법의 한계를 지적하고, 단일 레이어 내에서 공간 및 시간 어텐션을 수행하는 새로운 융합 아키텍처를 소개한다. 우리 방법의 핵심은 토큰이 동일한 프레임, 동일한 타임스탬프 또는 동일한 시점에 있는 다른 토큰에 주의를 기울이는 희소 어텐션 패턴이다. 두 번째 부분에서는 기존 3D 재구성 알고리즘을 확장하여 가우시안 헤드, 카메라 토큰 교체 알고리즘, 추가적인 동적 레이어 및 학습을 도입한다. 전반적으로, 우리는 4D 생성 분야에서 시각적 품질과 재구성 능력을 모두 향상시키는 새로운 최첨단 기술을 확립한다.

English

We propose the first framework capable of computing a 4D spatio-temporal grid of video frames and 3D Gaussian particles for each time step using a feed-forward architecture. Our architecture has two main components, a 4D video model and a 4D reconstruction model. In the first part, we analyze current 4D video diffusion architectures that perform spatial and temporal attention either sequentially or in parallel within a two-stream design. We highlight the limitations of existing approaches and introduce a novel fused architecture that performs spatial and temporal attention within a single layer. The key to our method is a sparse attention pattern, where tokens attend to others in the same frame, at the same timestamp, or from the same viewpoint. In the second part, we extend existing 3D reconstruction algorithms by introducing a Gaussian head, a camera token replacement algorithm, and additional dynamic layers and training. Overall, we establish a new state of the art for 4D generation, improving both visual quality and reconstruction capability.

4Real-Video-V2: 4D 장면 생성을 위한 융합된 시점-시간 어텐션 및 피드포워드 재구성

4Real-Video-V2: Fused View-Time Attention and Feedforward Reconstruction for 4D Scene Generation

초록

Support