ChatPaper.aiChatPaper

《闻声见影:利用Mirage从音频生成A-Roll视频》

Seeing Voices: Generating A-Roll Video from Audio with Mirage

June 9, 2025
作者: Aditi Sundararaman, Amogh Adishesha, Andrew Jaegle, Dan Bigioi, Hyoung-Kyu Song, Jon Kyl, Justin Mao, Kevin Lan, Mojtaba Komeili, ShahRukh Athar, Sheila Babayan, Stanislau Beliasau, William Buchwalter
cs.AI

摘要

从专业电影制作到用户生成内容,创作者和观众早已认识到视频的力量在于我们听到的内容(视频的音频轨道)与所见画面(视频的图像序列)的和谐统一。当前视频生成方法要么忽视声音,专注于通用但无声的图像序列生成;要么同时处理视觉和音频元素,但局限于特定应用领域,如重新配音。我们推出了Mirage,这是一款音频到视频的基础模型,擅长从零开始根据音频输入生成逼真、富有表现力的图像。当与现有的语音合成技术(文本到语音,即TTS)结合时,Mirage能够创造出引人入胜的多模态视频。在针对人物讲话的音频视频素材(A-roll)进行训练,并以包含语音的音频为条件时,Mirage能生成人物根据输入音频隐含的表演进行可信诠释的视频。我们的核心技术贡献在于提出了一种统一的方法,用于训练基于自注意力的音频到视频生成模型,无论是从头开始还是基于现有权重。这一方法论使Mirage在保持音频到视频生成方法通用性的同时,其输出在主观质量上优于那些融合了音频特定架构或针对人物、语音、图像或音频捕捉细节的特定损失组件的方法。我们鼓励读者亲自观看并聆听Mirage的成果(详见论文及评论中的链接)。
English
From professional filmmaking to user-generated content, creators and consumers have long recognized that the power of video depends on the harmonious integration of what we hear (the video's audio track) with what we see (the video's image sequence). Current approaches to video generation either ignore sound to focus on general-purpose but silent image sequence generation or address both visual and audio elements but focus on restricted application domains such as re-dubbing. We introduce Mirage, an audio-to-video foundation model that excels at generating realistic, expressive output imagery from scratch given an audio input. When integrated with existing methods for speech synthesis (text-to-speech, or TTS), Mirage results in compelling multimodal video. When trained on audio-video footage of people talking (A-roll) and conditioned on audio containing speech, Mirage generates video of people delivering a believable interpretation of the performance implicit in input audio. Our central technical contribution is a unified method for training self-attention-based audio-to-video generation models, either from scratch or given existing weights. This methodology allows Mirage to retain generality as an approach to audio-to-video generation while producing outputs of superior subjective quality to methods that incorporate audio-specific architectures or loss components specific to people, speech, or details of how images or audio are captured. We encourage readers to watch and listen to the results of Mirage for themselves (see paper and comments for links).
PDF222June 11, 2025