Tora：面向轨迹的扩散变压器用于视频生成

摘要

最近对扩散变压器（DiT）的进展展示了在生成高质量视频内容方面的显著熟练度。然而，基于变压器的扩散模型在有效生成具有可控运动的视频方面的潜力仍是一个探索有限的领域。本文介绍了Tora，这是第一个面向轨迹的DiT框架，同时整合了文本、视觉和轨迹条件用于视频生成。具体而言，Tora 包括轨迹提取器（TE）、时空DiT和运动引导融合器（MGF）。TE将任意轨迹编码为具有层次结构的时空运动块，使用3D视频压缩网络。MGF将这些运动块整合到DiT块中，生成遵循轨迹的连贯视频。我们的设计与DiT的可扩展性完美契合，可以精确控制视频内容的动态特性，包括不同持续时间、宽高比和分辨率。大量实验表明，Tora 在实现高运动保真度方面表现出色，同时精细地模拟了物理世界的运动。页面链接：https://ali-videoai.github.io/tora_video。

English

Recent advancements in Diffusion Transformer (DiT) have demonstrated remarkable proficiency in producing high-quality video content. Nonetheless, the potential of transformer-based diffusion models for effectively generating videos with controllable motion remains an area of limited exploration. This paper introduces Tora, the first trajectory-oriented DiT framework that integrates textual, visual, and trajectory conditions concurrently for video generation. Specifically, Tora consists of a Trajectory Extractor~(TE), a Spatial-Temporal DiT, and a Motion-guidance Fuser~(MGF). The TE encodes arbitrary trajectories into hierarchical spacetime motion patches with a 3D video compression network. The MGF integrates the motion patches into the DiT blocks to generate consistent videos following trajectories. Our design aligns seamlessly with DiT's scalability, allowing precise control of video content's dynamics with diverse durations, aspect ratios, and resolutions. Extensive experiments demonstrate Tora's excellence in achieving high motion fidelity, while also meticulously simulating the movement of the physical world. Page can be found at https://ali-videoai.github.io/tora_video.

Tora：面向轨迹的扩散变压器用于视频生成

Tora: Trajectory-oriented Diffusion Transformer for Video Generation

摘要

Support