VideoGrain: 다중 세분화 비디오 편집을 위한 시공간 주의력 조절

초록

최근 확산 모델의 발전으로 비디오 생성 및 편집 기능이 크게 향상되었습니다. 그러나 클래스 수준, 인스턴스 수준, 부분 수준의 수정을 포함하는 다중 단위 비디오 편집은 여전히 큰 도전 과제로 남아 있습니다. 다중 단위 편집의 주요 어려움은 텍스트-영역 제어의 의미론적 불일치와 확산 모델 내의 특징 결합 문제입니다. 이러한 어려움을 해결하기 위해, 우리는 비디오 콘텐츠에 대한 세밀한 제어를 달성하기 위해 시공간(교차 및 자기) 주의 메커니즘을 조절하는 제로샷 접근 방식인 VideoGrain을 제안합니다. 우리는 교차 주의에서 각 지역 프롬프트의 주의를 해당 공간적으로 분리된 영역으로 증폭시키고 관련 없는 영역과의 상호작용을 최소화함으로써 텍스트-영역 제어를 강화합니다. 또한, 자기 주의에서 영역 내 인식을 증가시키고 영역 간 간섭을 줄여 특징 분리를 개선합니다. 광범위한 실험을 통해 우리의 방법이 실제 시나리오에서 최첨단 성능을 달성함을 입증했습니다. 우리의 코드, 데이터, 데모는 https://knightyxp.github.io/VideoGrain_project_page/에서 확인할 수 있습니다.

English

Recent advancements in diffusion models have significantly improved video generation and editing capabilities. However, multi-grained video editing, which encompasses class-level, instance-level, and part-level modifications, remains a formidable challenge. The major difficulties in multi-grained editing include semantic misalignment of text-to-region control and feature coupling within the diffusion model. To address these difficulties, we present VideoGrain, a zero-shot approach that modulates space-time (cross- and self-) attention mechanisms to achieve fine-grained control over video content. We enhance text-to-region control by amplifying each local prompt's attention to its corresponding spatial-disentangled region while minimizing interactions with irrelevant areas in cross-attention. Additionally, we improve feature separation by increasing intra-region awareness and reducing inter-region interference in self-attention. Extensive experiments demonstrate our method achieves state-of-the-art performance in real-world scenarios. Our code, data, and demos are available at https://knightyxp.github.io/VideoGrain_project_page/

VideoGrain: 다중 세분화 비디오 편집을 위한 시공간 주의력 조절

VideoGrain: Modulating Space-Time Attention for Multi-grained Video Editing

초록

Support