ChatPaper.aiChatPaper

通過觀看電影學習音頻重點標註

Learning to Highlight Audio by Watching Movies

May 17, 2025
作者: Chao Huang, Ruohan Gao, J. M. F. Tsang, Jan Kurcius, Cagdas Bilen, Chenliang Xu, Anurag Kumar, Sanjeel Parekh
cs.AI

摘要

近年來,影片內容的創作與消費顯著增長。打造引人入勝的內容,需要精心策劃視覺與音效元素。儘管透過最佳視角選擇或後期編輯等技術進行的視覺線索策劃,一直是媒體製作的核心,但其自然對應的音效卻未經歷同等程度的進步。這往往導致視覺與聽覺顯著性之間的不協調。為彌合這一差距,我們引入了一項新任務:視覺引導的音效突出,旨在根據伴隨的影片引導,轉換音效以提供適當的突出效果,最終創造出更和諧的視聽體驗。我們提出了一個基於Transformer的多模態框架來解決這一任務。為了訓練我們的模型,我們還引入了一個新的數據集——混音數據集,利用電影中精細的音效與視覺製作,提供了一種免費的監督形式。我們開發了一個偽數據生成過程,透過分離、調整和重新混音的三步流程,模擬現實世界中混音不佳的情況。我們的方法在多項定量與主觀評估中均持續優於多個基準。我們還系統地研究了不同類型上下文引導的影響以及數據集的難度級別。我們的項目頁面在此:https://wikichao.github.io/VisAH/。
English
Recent years have seen a significant increase in video content creation and consumption. Crafting engaging content requires the careful curation of both visual and audio elements. While visual cue curation, through techniques like optimal viewpoint selection or post-editing, has been central to media production, its natural counterpart, audio, has not undergone equivalent advancements. This often results in a disconnect between visual and acoustic saliency. To bridge this gap, we introduce a novel task: visually-guided acoustic highlighting, which aims to transform audio to deliver appropriate highlighting effects guided by the accompanying video, ultimately creating a more harmonious audio-visual experience. We propose a flexible, transformer-based multimodal framework to solve this task. To train our model, we also introduce a new dataset -- the muddy mix dataset, leveraging the meticulous audio and video crafting found in movies, which provides a form of free supervision. We develop a pseudo-data generation process to simulate poorly mixed audio, mimicking real-world scenarios through a three-step process -- separation, adjustment, and remixing. Our approach consistently outperforms several baselines in both quantitative and subjective evaluation. We also systematically study the impact of different types of contextual guidance and difficulty levels of the dataset. Our project page is here: https://wikichao.github.io/VisAH/.

Summary

AI-Generated Summary

PDF22May 21, 2025