Human4DiT: 使用4D扩散Transformer进行自由视角人类视频生成

摘要

我们提出了一种新颖的方法，用于从单个图像在任意视角下生成高质量、时空连贯的人类视频。我们的框架结合了U-Net的准确条件注入和扩散Transformer捕获全局视角和时间相关性的优势。核心是级联的4D Transformer架构，将注意力在视角、时间和空间维度上进行因式分解，实现对4D空间的高效建模。通过将人类身份、摄像机参数和时间信号精确注入到相应的Transformer中，实现了精确的条件设定。为了训练这个模型，我们整理了一个跨越图像、视频、多视角数据和3D/4D扫描的多维数据集，以及一个多维训练策略。我们的方法克服了基于GAN或基于UNet扩散模型的先前方法的局限性，这些方法在处理复杂运动和视角变化时存在困难。通过大量实验，我们展示了我们的方法能够合成逼真、连贯且自由视角的人类视频，为虚拟现实和动画等领域的先进多媒体应用铺平了道路。我们的项目网站是https://human4dit.github.io。

English

We present a novel approach for generating high-quality, spatio-temporally coherent human videos from a single image under arbitrary viewpoints. Our framework combines the strengths of U-Nets for accurate condition injection and diffusion transformers for capturing global correlations across viewpoints and time. The core is a cascaded 4D transformer architecture that factorizes attention across views, time, and spatial dimensions, enabling efficient modeling of the 4D space. Precise conditioning is achieved by injecting human identity, camera parameters, and temporal signals into the respective transformers. To train this model, we curate a multi-dimensional dataset spanning images, videos, multi-view data and 3D/4D scans, along with a multi-dimensional training strategy. Our approach overcomes the limitations of previous methods based on GAN or UNet-based diffusion models, which struggle with complex motions and viewpoint changes. Through extensive experiments, we demonstrate our method's ability to synthesize realistic, coherent and free-view human videos, paving the way for advanced multimedia applications in areas such as virtual reality and animation. Our project website is https://human4dit.github.io.

Human4DiT: 使用4D扩散Transformer进行自由视角人类视频生成

Human4DiT: Free-view Human Video Generation with 4D Diffusion Transformer

摘要

Support