Tora2：面向多實體視頻生成的運動與外觀定制化擴散變換器

摘要

近期，基於擴散變換器模型的運動引導視頻生成技術，如Tora，已取得顯著進展。本文介紹了Tora的升級版——Tora2，該版本通過多項設計改進，進一步提升了在外觀與運動定制方面的能力。具體而言，我們引入了一種解耦的個性化提取器，能夠為多個開放集實體生成全面的個性化嵌入，相比以往方法，更好地保留了細粒度的視覺細節。在此基礎上，我們設計了一種門控自注意力機制，用於整合每個實體的軌跡、文本描述及視覺信息，這一創新顯著減少了訓練過程中多模態條件對齊的偏差。此外，我們還引入了一種對比損失，通過運動與個性化嵌入之間的顯式映射，聯合優化軌跡動態與實體一致性。據我們所知，Tora2是首個實現視頻生成中多實體外觀與運動同步定制的方法。實驗結果表明，Tora2在與頂尖定制方法的性能對比中表現出競爭力，同時提供了先進的運動控制能力，這標誌著多條件視頻生成領域的關鍵進展。項目頁面：https://github.com/alibaba/Tora。

English

Recent advances in diffusion transformer models for motion-guided video generation, such as Tora, have shown significant progress. In this paper, we present Tora2, an enhanced version of Tora, which introduces several design improvements to expand its capabilities in both appearance and motion customization. Specifically, we introduce a decoupled personalization extractor that generates comprehensive personalization embeddings for multiple open-set entities, better preserving fine-grained visual details compared to previous methods. Building on this, we design a gated self-attention mechanism to integrate trajectory, textual description, and visual information for each entity. This innovation significantly reduces misalignment in multimodal conditioning during training. Moreover, we introduce a contrastive loss that jointly optimizes trajectory dynamics and entity consistency through explicit mapping between motion and personalization embeddings. Tora2 is, to our best knowledge, the first method to achieve simultaneous multi-entity customization of appearance and motion for video generation. Experimental results demonstrate that Tora2 achieves competitive performance with state-of-the-art customization methods while providing advanced motion control capabilities, which marks a critical advancement in multi-condition video generation. Project page: https://github.com/alibaba/Tora .

Tora2：面向多實體視頻生成的運動與外觀定制化擴散變換器

Tora2: Motion and Appearance Customized Diffusion Transformer for Multi-Entity Video Generation

摘要

Support