DomainShuttle: 自由开放域主题驱动的文本到视频生成

摘要

开放域主体驱动文本到视频（S2V）生成在学术界和工业界引起了广泛关注。开放域S2V主要涉及两种场景：域内场景，要求尽可能保留参考主体特征；以及跨域场景，需保留主体的内在特征，同时允许与主体无关的属性根据文本提示灵活变化。现有方法主要侧重于在域内场景中最大化主体保真度，这限制了它们在跨域场景（如新颖风格、语义组合或域属性）中的可编辑性和适应性。本研究提出，理想的S2V方法应能在不同域之间灵活切换，在域内和跨域场景中均实现强性能。为此，我们提出DomainShuttle，能够在开放域视频个性化中实现高保真度和生成灵活性。具体而言，我们引入Domain-MoT，解耦视频与参考特征，并引入域感知的AdaLN，用于对参考图像进行特定域的建模。随后，我们提出视频-参考双RoPE方案，将参考图像标记和视频标记分别置于独立的RoPE空间中，以实现精确的主体级空间建模；同时引入跨对一致性损失，旨在提取不受无关特征影响的主体内在特征。大量实验表明，DomainShuttle在多种开放域应用场景中相比现有方法实现了显著的性能提升，展现出高主体保真度和生成灵活性。

English

Open domain subject-driven text-to-video (S2V) generation has drawn significant interest in academia and industry. Open domain S2V mainly involves two scenarios: in-domain, which requires retaining the reference subject features as much as possible, and cross-domain, which preserves the intrinsic features of the subject while allowing subject-irrelevant properties to vary flexibly according to the text prompt. Existing methods primarily focus on maximizing subject fidelity in in-domain scenarios, which limits their editability and adaptability in cross-domain scenarios, such as novel styles, semantic combinations, or domain attributes. In this study, we propose that an ideal S2V method should flexibly shuttle between different domains, achieving strong performance in both in-domain and cross-domain scenarios. To this end, we propose DomainShuttle, which could achieve high fidelity and generative flexibility for open domain video personalization. Specifically, we introduce Domain-MoT, which decouples videos and reference features and introduces the domain-aware AdaLN for domain-specific modeling of reference images. We then introduce the Video-Reference DualRoPE scheme, which places reference image tokens and video tokens in separate RoPE spaces to enable precise subject-level spatial modeling, and Cross-Pair Consistent Loss, which aims to extract intrinsic subject features unaffected by irrelevant features. Extensive experiments demonstrate that DomainShuttle achieves significant performance improvements over existing methods, exhibiting high subject fidelity and generative flexibility across diverse open domain application scenarios.