Human4DiT:具有4D擴散Transformer的自由視角人類影片生成
Human4DiT: Free-view Human Video Generation with 4D Diffusion Transformer
May 27, 2024
作者: Ruizhi Shao, Youxin Pang, Zerong Zheng, Jingxiang Sun, Yebin Liu
cs.AI
摘要
我們提出了一種新穎的方法,用於從單張圖像在任意視角下生成高質量、時空一致的人類影片。我們的框架結合了 U-Net 的準確條件注入和擴散 Transformer 的全局相關性捕獲能力,用於跨視角和時間的全局相關性捕獲。核心是一種級聯的4D Transformer架構,對視角、時間和空間維度進行了注意力分解,實現了對4D空間的高效建模。通過將人類身份、攝像機參數和時間信號注入到相應的 Transformer 中,實現了精確的條件設置。為了訓練這個模型,我們整理了一個多維數據集,涵蓋圖像、影片、多視圖數據和3D/4D掃描,以及多維訓練策略。我們的方法克服了基於GAN或基於UNet擴散模型的先前方法的局限性,這些方法在處理複雜運動和視角變化時存在困難。通過大量實驗,我們展示了我們的方法合成逼真、一致且自由視角的人類影片的能力,為虛擬現實和動畫等領域的先進多媒體應用打開了道路。我們的項目網站是https://human4dit.github.io。
English
We present a novel approach for generating high-quality, spatio-temporally
coherent human videos from a single image under arbitrary viewpoints. Our
framework combines the strengths of U-Nets for accurate condition injection and
diffusion transformers for capturing global correlations across viewpoints and
time. The core is a cascaded 4D transformer architecture that factorizes
attention across views, time, and spatial dimensions, enabling efficient
modeling of the 4D space. Precise conditioning is achieved by injecting human
identity, camera parameters, and temporal signals into the respective
transformers. To train this model, we curate a multi-dimensional dataset
spanning images, videos, multi-view data and 3D/4D scans, along with a
multi-dimensional training strategy. Our approach overcomes the limitations of
previous methods based on GAN or UNet-based diffusion models, which struggle
with complex motions and viewpoint changes. Through extensive experiments, we
demonstrate our method's ability to synthesize realistic, coherent and
free-view human videos, paving the way for advanced multimedia applications in
areas such as virtual reality and animation. Our project website is
https://human4dit.github.io.