ChatPaper.aiChatPaper

Tora:以軌跡為導向的擴散變壓器用於視頻生成

Tora: Trajectory-oriented Diffusion Transformer for Video Generation

July 31, 2024
作者: Zhenghao Zhang, Junchao Liao, Menghao Li, Long Qin, Weizhi Wang
cs.AI

摘要

最近在擴散Transformer(DiT)領域的進展展示了在生成高質量視頻內容方面的卓越能力。然而,基於Transformer的擴散模型在有效生成具有可控運動的視頻方面的潛力仍然是一個探索有限的領域。本文介紹了Tora,這是第一個以軌跡為導向的DiT框架,同時整合了文本、視覺和軌跡條件,用於視頻生成。具體來說,Tora 包括軌跡提取器(TE)、時空DiT和運動引導融合器(MGF)。TE將任意軌跡編碼為具有層次結構的時空運動片段,並使用3D視頻壓縮網絡。MGF將這些運動片段集成到DiT塊中,以生成遵循軌跡的一致視頻。我們的設計與DiT的可擴展性無縫對齊,可以精確控制具有不同持續時間、寬高比和分辨率的視頻內容動態。大量實驗證明了Tora在實現高運動保真度方面的優越性,同時也精細模擬了物理世界的運動。詳細信息可在以下頁面找到:https://ali-videoai.github.io/tora_video。
English
Recent advancements in Diffusion Transformer (DiT) have demonstrated remarkable proficiency in producing high-quality video content. Nonetheless, the potential of transformer-based diffusion models for effectively generating videos with controllable motion remains an area of limited exploration. This paper introduces Tora, the first trajectory-oriented DiT framework that integrates textual, visual, and trajectory conditions concurrently for video generation. Specifically, Tora consists of a Trajectory Extractor~(TE), a Spatial-Temporal DiT, and a Motion-guidance Fuser~(MGF). The TE encodes arbitrary trajectories into hierarchical spacetime motion patches with a 3D video compression network. The MGF integrates the motion patches into the DiT blocks to generate consistent videos following trajectories. Our design aligns seamlessly with DiT's scalability, allowing precise control of video content's dynamics with diverse durations, aspect ratios, and resolutions. Extensive experiments demonstrate Tora's excellence in achieving high motion fidelity, while also meticulously simulating the movement of the physical world. Page can be found at https://ali-videoai.github.io/tora_video.

Summary

AI-Generated Summary

PDF282November 28, 2024