Vielfältige und abgestimmte Audio-zu-Video-Generierung durch Anpassung von Text-zu-Video-Modellen

papers.abstract

Wir betrachten die Aufgabe, vielfältige und realistische Videos zu generieren, die durch natürliche Audioaufnahmen aus einer Vielzahl von semantischen Klassen gesteuert werden. Für diese Aufgabe müssen die Videos sowohl global als auch zeitlich mit dem Eingabe-Audio abgestimmt sein: global, indem das Eingabe-Audio semantisch mit dem gesamten Ausgabe-Video verknüpft ist, und zeitlich, indem jedes Segment des Eingabe-Audios mit einem entsprechenden Segment des Videos assoziiert wird. Wir nutzen ein bestehendes textgesteuertes Video-Generierungsmodell und ein vortrainiertes Audio-Encoder-Modell. Die vorgeschlagene Methode basiert auf einem leichtgewichtigen Adapter-Netzwerk, das lernt, die audio-basierte Repräsentation auf die Eingabedarstellung abzubilden, die vom Text-zu-Video-Generierungsmodell erwartet wird. Dadurch ermöglicht es auch die Video-Generierung, die sowohl durch Text, Audio als auch, soweit wir feststellen können, erstmals durch Text und Audio gemeinsam gesteuert wird. Wir validieren unsere Methode umfassend an drei Datensätzen, die eine signifikante semantische Vielfalt von Audio-Video-Beispielen aufweisen, und schlagen zudem eine neuartige Bewertungsmetrik (AV-Align) vor, um die Abstimmung der generierten Videos mit den Eingabe-Audioaufnahmen zu bewerten. AV-Align basiert auf der Erkennung und dem Vergleich von Energiepeaks in beiden Modalitäten. Im Vergleich zu aktuellen State-of-the-Art-Ansätzen generiert unsere Methode Videos, die besser mit dem Eingabe-Sound abgestimmt sind, sowohl inhaltlich als auch auf der Zeitachse. Wir zeigen außerdem, dass die von unserer Methode erzeugten Videos eine höhere visuelle Qualität aufweisen und vielfältiger sind.

English

We consider the task of generating diverse and realistic videos guided by natural audio samples from a wide variety of semantic classes. For this task, the videos are required to be aligned both globally and temporally with the input audio: globally, the input audio is semantically associated with the entire output video, and temporally, each segment of the input audio is associated with a corresponding segment of that video. We utilize an existing text-conditioned video generation model and a pre-trained audio encoder model. The proposed method is based on a lightweight adaptor network, which learns to map the audio-based representation to the input representation expected by the text-to-video generation model. As such, it also enables video generation conditioned on text, audio, and, for the first time as far as we can ascertain, on both text and audio. We validate our method extensively on three datasets demonstrating significant semantic diversity of audio-video samples and further propose a novel evaluation metric (AV-Align) to assess the alignment of generated videos with input audio samples. AV-Align is based on the detection and comparison of energy peaks in both modalities. In comparison to recent state-of-the-art approaches, our method generates videos that are better aligned with the input sound, both with respect to content and temporal axis. We also show that videos produced by our method present higher visual quality and are more diverse.

Vielfältige und abgestimmte Audio-zu-Video-Generierung durch Anpassung von Text-zu-Video-Modellen

Diverse and Aligned Audio-to-Video Generation via Text-to-Video Model Adaptation

papers.abstract

Support