Tora2:面向多實體視頻生成的運動與外觀定制化擴散變換器
Tora2: Motion and Appearance Customized Diffusion Transformer for Multi-Entity Video Generation
July 8, 2025
作者: Zhenghao Zhang, Junchao Liao, Xiangyu Meng, Long Qin, Weizhi Wang
cs.AI
摘要
近期,基於擴散變換器模型的運動引導視頻生成技術,如Tora,已取得顯著進展。本文介紹了Tora的升級版——Tora2,該版本通過多項設計改進,進一步提升了在外觀與運動定制方面的能力。具體而言,我們引入了一種解耦的個性化提取器,能夠為多個開放集實體生成全面的個性化嵌入,相比以往方法,更好地保留了細粒度的視覺細節。在此基礎上,我們設計了一種門控自注意力機制,用於整合每個實體的軌跡、文本描述及視覺信息,這一創新顯著減少了訓練過程中多模態條件對齊的偏差。此外,我們還引入了一種對比損失,通過運動與個性化嵌入之間的顯式映射,聯合優化軌跡動態與實體一致性。據我們所知,Tora2是首個實現視頻生成中多實體外觀與運動同步定制的方法。實驗結果表明,Tora2在與頂尖定制方法的性能對比中表現出競爭力,同時提供了先進的運動控制能力,這標誌著多條件視頻生成領域的關鍵進展。項目頁面:https://github.com/alibaba/Tora。
English
Recent advances in diffusion transformer models for motion-guided video
generation, such as Tora, have shown significant progress. In this paper, we
present Tora2, an enhanced version of Tora, which introduces several design
improvements to expand its capabilities in both appearance and motion
customization. Specifically, we introduce a decoupled personalization extractor
that generates comprehensive personalization embeddings for multiple open-set
entities, better preserving fine-grained visual details compared to previous
methods. Building on this, we design a gated self-attention mechanism to
integrate trajectory, textual description, and visual information for each
entity. This innovation significantly reduces misalignment in multimodal
conditioning during training. Moreover, we introduce a contrastive loss that
jointly optimizes trajectory dynamics and entity consistency through explicit
mapping between motion and personalization embeddings. Tora2 is, to our best
knowledge, the first method to achieve simultaneous multi-entity customization
of appearance and motion for video generation. Experimental results demonstrate
that Tora2 achieves competitive performance with state-of-the-art customization
methods while providing advanced motion control capabilities, which marks a
critical advancement in multi-condition video generation. Project page:
https://github.com/alibaba/Tora .