ChatPaper.aiChatPaper

《听见声音:利用Mirage从音频生成A-Roll视频》

Seeing Voices: Generating A-Roll Video from Audio with Mirage

June 9, 2025
作者: Aditi Sundararaman, Amogh Adishesha, Andrew Jaegle, Dan Bigioi, Hyoung-Kyu Song, Jon Kyl, Justin Mao, Kevin Lan, Mojtaba Komeili, ShahRukh Athar, Sheila Babayan, Stanislau Beliasau, William Buchwalter
cs.AI

摘要

從專業電影製作到用戶生成內容,創作者與觀眾早已認識到,視頻的力量在於我們所聽(視頻的音頻軌)與我們所見(視頻的圖像序列)之間的和諧統一。當前的視頻生成方法要么忽視聲音,專注於通用但無聲的圖像序列生成,要么同時處理視覺與音頻元素,但局限於特定應用領域,如重新配音。我們介紹了Mirage,這是一種音頻到視頻的基礎模型,擅長於根據音頻輸入從零開始生成逼真且富有表現力的輸出圖像。當與現有的語音合成方法(文本到語音,或TTS)結合時,Mirage能夠產生引人入勝的多模態視頻。當以人物講話的音視頻素材(A-roll)進行訓練,並以包含語音的音頻為條件時,Mirage能夠生成人物根據輸入音頻中隱含的表演進行可信詮釋的視頻。我們的核心技術貢獻是一種統一的訓練方法,用於基於自注意力機制的音頻到視頻生成模型,無論是從零開始還是基於現有權重。這一方法論使Mirage在保持作為音頻到視頻生成方法的通用性的同時,產生的輸出在主觀質量上優於那些融合了特定於音頻的架構或針對人物、語音、圖像或音頻捕捉細節的特定損失組件的方法。我們鼓勵讀者親自觀看和聆聽Mirage的成果(詳見論文及評論中的鏈接)。
English
From professional filmmaking to user-generated content, creators and consumers have long recognized that the power of video depends on the harmonious integration of what we hear (the video's audio track) with what we see (the video's image sequence). Current approaches to video generation either ignore sound to focus on general-purpose but silent image sequence generation or address both visual and audio elements but focus on restricted application domains such as re-dubbing. We introduce Mirage, an audio-to-video foundation model that excels at generating realistic, expressive output imagery from scratch given an audio input. When integrated with existing methods for speech synthesis (text-to-speech, or TTS), Mirage results in compelling multimodal video. When trained on audio-video footage of people talking (A-roll) and conditioned on audio containing speech, Mirage generates video of people delivering a believable interpretation of the performance implicit in input audio. Our central technical contribution is a unified method for training self-attention-based audio-to-video generation models, either from scratch or given existing weights. This methodology allows Mirage to retain generality as an approach to audio-to-video generation while producing outputs of superior subjective quality to methods that incorporate audio-specific architectures or loss components specific to people, speech, or details of how images or audio are captured. We encourage readers to watch and listen to the results of Mirage for themselves (see paper and comments for links).
PDF222June 11, 2025