BroadWay: 교육 없이 텍스트에서 비디오로의 생성 모델을 향상시키세요.

초록

텍스트-비디오 (T2V) 생성 모델은 편리한 시각적 생성을 제공하여 최근에 큰 관심을 받고 있습니다. 그러나 생성된 비디오는 구조적 불합리성, 시간적 불일치, 움직임 부족 등의 아티팩트를 보일 수 있으며 종종 거의 정지된 비디오로 이어질 수 있습니다. 본 연구에서는 서로 다른 블록 간의 시간적 주의 맵의 불일치와 시간적 불일치 발생 간의 상관 관계를 확인했습니다. 또한, 생성된 비디오의 움직임 크기와 관련된 에너지가 시간적 주의 맵에 포함된 것을 관찰했습니다. 이러한 관찰을 기반으로 우리는 추가 매개변수를 도입하거나 메모리를 확장하거나 샘플링 시간을 늘리지 않고 텍스트-비디오 생성의 품질을 향상시키는 BroadWay라는 훈련 불필요한 방법을 제안합니다. 구체적으로, BroadWay는 두 가지 주요 구성 요소로 구성됩니다: 1) 시간적 자기-가이드는 다양한 디코더 블록 간의 시간적 주의 맵의 불일치를 줄이는 것을 통해 생성된 비디오의 구조적 합리성과 시간적 일관성을 향상시킵니다. 2) 푸리에 기반의 움직임 향상은 맵의 에너지를 증폭함으로써 움직임의 크기와 풍부함을 향상시킵니다. 광범위한 실험 결과는 BroadWay가 추가 비용을 거의 요구하지 않으면서 텍스트-비디오 생성의 품질을 현저히 향상시킨다는 것을 보여줍니다.

English

The text-to-video (T2V) generation models, offering convenient visual creation, have recently garnered increasing attention. Despite their substantial potential, the generated videos may present artifacts, including structural implausibility, temporal inconsistency, and a lack of motion, often resulting in near-static video. In this work, we have identified a correlation between the disparity of temporal attention maps across different blocks and the occurrence of temporal inconsistencies. Additionally, we have observed that the energy contained within the temporal attention maps is directly related to the magnitude of motion amplitude in the generated videos. Based on these observations, we present BroadWay, a training-free method to improve the quality of text-to-video generation without introducing additional parameters, augmenting memory or sampling time. Specifically, BroadWay is composed of two principal components: 1) Temporal Self-Guidance improves the structural plausibility and temporal consistency of generated videos by reducing the disparity between the temporal attention maps across various decoder blocks. 2) Fourier-based Motion Enhancement enhances the magnitude and richness of motion by amplifying the energy of the map. Extensive experiments demonstrate that BroadWay significantly improves the quality of text-to-video generation with negligible additional cost.