Tora2:面向多实体视频生成的运动与外观定制化扩散Transformer
Tora2: Motion and Appearance Customized Diffusion Transformer for Multi-Entity Video Generation
July 8, 2025
作者: Zhenghao Zhang, Junchao Liao, Xiangyu Meng, Long Qin, Weizhi Wang
cs.AI
摘要
近期,基于扩散变换器的运动引导视频生成模型(如Tora)取得了显著进展。本文介绍了Tora的增强版本——Tora2,该版本通过多项设计改进,进一步提升了其在外观与运动定制方面的能力。具体而言,我们引入了一种解耦的个性化提取器,能够为多个开放集实体生成全面的个性化嵌入,相较于以往方法,更好地保留了细粒度的视觉细节。在此基础上,我们设计了一种门控自注意力机制,用于整合每个实体的轨迹、文本描述及视觉信息。这一创新显著减少了训练过程中多模态条件对齐的偏差。此外,我们提出了一种对比损失函数,通过运动与个性化嵌入之间的显式映射,联合优化轨迹动态与实体一致性。据我们所知,Tora2是首个实现视频生成中多实体外观与运动同步定制的方法。实验结果表明,Tora2在保持与最先进定制方法竞争性能的同时,提供了更高级的运动控制能力,这标志着多条件视频生成领域的关键进步。项目页面:https://github.com/alibaba/Tora。
English
Recent advances in diffusion transformer models for motion-guided video
generation, such as Tora, have shown significant progress. In this paper, we
present Tora2, an enhanced version of Tora, which introduces several design
improvements to expand its capabilities in both appearance and motion
customization. Specifically, we introduce a decoupled personalization extractor
that generates comprehensive personalization embeddings for multiple open-set
entities, better preserving fine-grained visual details compared to previous
methods. Building on this, we design a gated self-attention mechanism to
integrate trajectory, textual description, and visual information for each
entity. This innovation significantly reduces misalignment in multimodal
conditioning during training. Moreover, we introduce a contrastive loss that
jointly optimizes trajectory dynamics and entity consistency through explicit
mapping between motion and personalization embeddings. Tora2 is, to our best
knowledge, the first method to achieve simultaneous multi-entity customization
of appearance and motion for video generation. Experimental results demonstrate
that Tora2 achieves competitive performance with state-of-the-art customization
methods while providing advanced motion control capabilities, which marks a
critical advancement in multi-condition video generation. Project page:
https://github.com/alibaba/Tora .