DomainShuttle: 자유 형식 오픈 도메인 주제 기반 텍스트-비디오 생성

초록

오픈 도메인 주체 기반 텍스트-투-비디오 생성(S2V)은 학계와 산업계에서 큰 관심을 받고 있다. 오픈 도메인 S2V는 주로 두 가지 시나리오를 포함한다: 참조 주체의 특징을 최대한 유지해야 하는 도메인 내 시나리오와, 주체의 본질적 특징은 보존하면서 주체와 무관한 속성은 텍스트 프롬프트에 따라 유연하게 변할 수 있는 교차 도메인 시나리오이다. 기존 방법들은 주로 도메인 내 시나리오에서 주체 충실도를 극대화하는 데 초점을 맞추어, 새로운 스타일, 의미론적 조합 또는 도메인 속성과 같은 교차 도메인 시나리오에서의 편집 가능성과 적응성을 제한한다. 본 연구에서는 이상적인 S2V 방법이 서로 다른 도메인 간에 유연하게 이동하여 도메인 내 및 교차 도메인 시나리오 모두에서 강력한 성능을 달성해야 한다고 제안한다. 이를 위해, 오픈 도메인 비디오 개인화를 위해 높은 충실도와 생성 유연성을 달성할 수 있는 DomainShuttle을 제안한다. 구체적으로, 비디오와 참조 특징을 분리하고 참조 이미지의 도메인별 모델링을 위해 도메인 인식 AdaLN을 도입하는 Domain-MoT를 소개한다. 다음으로, 참조 이미지 토큰과 비디오 토큰을 별도의 RoPE 공간에 배치하여 정밀한 주체 수준의 공간 모델링을 가능하게 하는 Video-Reference DualRoPE 기법과, 무관한 특징에 영향을 받지 않는 본질적인 주체 특징을 추출하는 것을 목표로 하는 Cross-Pair 일관성 손실을 도입한다. 광범위한 실험을 통해 DomainShuttle이 다양한 오픈 도메인 응용 시나리오에서 높은 주체 충실도와 생성 유연성을 보이며 기존 방법들 대비 현저한 성능 향상을 달성함을 입증한다.

English

Open domain subject-driven text-to-video (S2V) generation has drawn significant interest in academia and industry. Open domain S2V mainly involves two scenarios: in-domain, which requires retaining the reference subject features as much as possible, and cross-domain, which preserves the intrinsic features of the subject while allowing subject-irrelevant properties to vary flexibly according to the text prompt. Existing methods primarily focus on maximizing subject fidelity in in-domain scenarios, which limits their editability and adaptability in cross-domain scenarios, such as novel styles, semantic combinations, or domain attributes. In this study, we propose that an ideal S2V method should flexibly shuttle between different domains, achieving strong performance in both in-domain and cross-domain scenarios. To this end, we propose DomainShuttle, which could achieve high fidelity and generative flexibility for open domain video personalization. Specifically, we introduce Domain-MoT, which decouples videos and reference features and introduces the domain-aware AdaLN for domain-specific modeling of reference images. We then introduce the Video-Reference DualRoPE scheme, which places reference image tokens and video tokens in separate RoPE spaces to enable precise subject-level spatial modeling, and Cross-Pair Consistent Loss, which aims to extract intrinsic subject features unaffected by irrelevant features. Extensive experiments demonstrate that DomainShuttle achieves significant performance improvements over existing methods, exhibiting high subject fidelity and generative flexibility across diverse open domain application scenarios.