开放词汇音视频语义分割
Open-Vocabulary Audio-Visual Semantic Segmentation
July 31, 2024
作者: Ruohao Guo, Liao Qu, Dantong Niu, Yanyu Qi, Wenzhen Yue, Ji Shi, Bowei Xing, Xianghua Ying
cs.AI
摘要
音视频语义分割(AVSS)旨在利用声音线索在视频中对声音对象进行分割和分类。然而,大多数方法基于封闭集假设运作,并且仅从训练数据中识别预定义类别,缺乏在实际应用中检测新类别的泛化能力。本文介绍了一个新任务:开放词汇音视频语义分割,将AVSS任务扩展到超出注释标签空间的开放世界场景。这是一个更具挑战性的任务,需要识别所有类别,甚至是在训练过程中从未见过或听说过的类别。此外,我们提出了第一个开放词汇AVSS框架,OV-AVSS,主要包括两部分:1)通用声源定位模块,执行音视频融合并定位所有潜在声音对象;2)开放词汇分类模块,利用大规模预训练视觉语言模型的先验知识来预测类别。为了正确评估开放词汇AVSS,我们基于AVSBench-semantic基准将零样本训练和测试子集进行了划分,即AVSBench-OV。大量实验证明了我们的模型在所有类别上具有强大的分割和零样本泛化能力。在AVSBench-OV数据集上,OV-AVSS在基础类别上达到了55.43%的mIoU,在新颖类别上达到了29.14%的mIoU,超过了最先进的零样本方法分别为41.88%/20.61%,以及开放词汇方法分别为10.2%/11.6%。源代码可在https://github.com/ruohaoguo/ovavss找到。
English
Audio-visual semantic segmentation (AVSS) aims to segment and classify
sounding objects in videos with acoustic cues. However, most approaches operate
on the close-set assumption and only identify pre-defined categories from
training data, lacking the generalization ability to detect novel categories in
practical applications. In this paper, we introduce a new task: open-vocabulary
audio-visual semantic segmentation, extending AVSS task to open-world scenarios
beyond the annotated label space. This is a more challenging task that requires
recognizing all categories, even those that have never been seen nor heard
during training. Moreover, we propose the first open-vocabulary AVSS framework,
OV-AVSS, which mainly consists of two parts: 1) a universal sound source
localization module to perform audio-visual fusion and locate all potential
sounding objects and 2) an open-vocabulary classification module to predict
categories with the help of the prior knowledge from large-scale pre-trained
vision-language models. To properly evaluate the open-vocabulary AVSS, we split
zero-shot training and testing subsets based on the AVSBench-semantic
benchmark, namely AVSBench-OV. Extensive experiments demonstrate the strong
segmentation and zero-shot generalization ability of our model on all
categories. On the AVSBench-OV dataset, OV-AVSS achieves 55.43% mIoU on base
categories and 29.14% mIoU on novel categories, exceeding the state-of-the-art
zero-shot method by 41.88%/20.61% and open-vocabulary method by 10.2%/11.6%.
The code is available at https://github.com/ruohaoguo/ovavss.Summary
AI-Generated Summary