ChatPaper.aiChatPaper

開放詞彙的音視覺語義分割

Open-Vocabulary Audio-Visual Semantic Segmentation

July 31, 2024
作者: Ruohao Guo, Liao Qu, Dantong Niu, Yanyu Qi, Wenzhen Yue, Ji Shi, Bowei Xing, Xianghua Ying
cs.AI

摘要

音視覺語義分割(AVSS)旨在利用聲音提示在視頻中對聲音對象進行分割和分類。然而,大多數方法基於閉集假設運作,僅從訓練數據中識別預定義的類別,缺乏在實際應用中檢測新類別的泛化能力。本文介紹一個新任務:開放詞彙音視覺語義分割,將AVSS任務擴展到超出標註標籤空間的開放世界場景。這是一個更具挑戰性的任務,需要識別所有類別,甚至是在訓練期間從未見過或聽過的類別。此外,我們提出了第一個開放詞彙AVSS框架,OV-AVSS,主要包括兩個部分:1)通用聲源定位模塊,執行音視覺融合並定位所有潛在的聲音對象;2)開放詞彙分類模塊,借助大規模預訓練視覺語言模型的先前知識來預測類別。為了正確評估開放詞彙AVSS,我們基於AVSBench-semantic基準將零樣本訓練和測試子集劃分,即AVSBench-OV。大量實驗證明了我們模型在所有類別上的強大分割和零樣本泛化能力。在AVSBench-OV數據集上,OV-AVSS在基本類別上實現了55.43%的mIoU,在新類別上實現了29.14%的mIoU,超過了最先進的零樣本方法分別為41.88%/20.61%和開放詞彙方法為10.2%/11.6%。代碼可在https://github.com/ruohaoguo/ovavss找到。
English
Audio-visual semantic segmentation (AVSS) aims to segment and classify sounding objects in videos with acoustic cues. However, most approaches operate on the close-set assumption and only identify pre-defined categories from training data, lacking the generalization ability to detect novel categories in practical applications. In this paper, we introduce a new task: open-vocabulary audio-visual semantic segmentation, extending AVSS task to open-world scenarios beyond the annotated label space. This is a more challenging task that requires recognizing all categories, even those that have never been seen nor heard during training. Moreover, we propose the first open-vocabulary AVSS framework, OV-AVSS, which mainly consists of two parts: 1) a universal sound source localization module to perform audio-visual fusion and locate all potential sounding objects and 2) an open-vocabulary classification module to predict categories with the help of the prior knowledge from large-scale pre-trained vision-language models. To properly evaluate the open-vocabulary AVSS, we split zero-shot training and testing subsets based on the AVSBench-semantic benchmark, namely AVSBench-OV. Extensive experiments demonstrate the strong segmentation and zero-shot generalization ability of our model on all categories. On the AVSBench-OV dataset, OV-AVSS achieves 55.43% mIoU on base categories and 29.14% mIoU on novel categories, exceeding the state-of-the-art zero-shot method by 41.88%/20.61% and open-vocabulary method by 10.2%/11.6%. The code is available at https://github.com/ruohaoguo/ovavss.

Summary

AI-Generated Summary

PDF92November 28, 2024