Open-Vocabulary Audio-Visuele Semantische Segmentatie

Samenvatting

Audio-visuele semantische segmentatie (AVSS) heeft als doel om geluidmakende objecten in video's te segmenteren en te classificeren met behulp van akoestische aanwijzingen. De meeste benaderingen werken echter op basis van de close-set-aanname en identificeren alleen vooraf gedefinieerde categorieën uit de trainingsdata, wat het generalisatievermogen mist om nieuwe categorieën in praktische toepassingen te detecteren. In dit artikel introduceren we een nieuwe taak: open-vocabulary audio-visuele semantische segmentatie, die de AVSS-taak uitbreidt naar open-wereldscenario's buiten de geannoteerde labelruimte. Dit is een uitdagendere taak die vereist dat alle categorieën worden herkend, zelfs die welke nooit zijn gezien of gehoord tijdens de training. Bovendien stellen we het eerste open-vocabulary AVSS-framework voor, OV-AVSS, dat voornamelijk bestaat uit twee delen: 1) een universele geluidsbronlocalisatiemodule om audio-visuele fusie uit te voeren en alle potentiële geluidmakende objecten te lokaliseren, en 2) een open-vocabulary classificatiemodule om categorieën te voorspellen met behulp van de voorkennis uit grootschalige vooraf getrainde vision-language-modellen. Om de open-vocabulary AVSS goed te evalueren, splitsen we zero-shot trainings- en testsubsets op basis van de AVSBench-semantic benchmark, genaamd AVSBench-OV. Uitgebreide experimenten tonen het sterke segmentatie- en zero-shot generalisatievermogen van ons model aan voor alle categorieën. Op de AVSBench-OV dataset behaalt OV-AVSS 55,43% mIoU op basis categorieën en 29,14% mIoU op nieuwe categorieën, wat de state-of-the-art zero-shot methode met 41,88%/20,61% en de open-vocabulary methode met 10,2%/11,6% overtreft. De code is beschikbaar op https://github.com/ruohaoguo/ovavss.

English

Audio-visual semantic segmentation (AVSS) aims to segment and classify sounding objects in videos with acoustic cues. However, most approaches operate on the close-set assumption and only identify pre-defined categories from training data, lacking the generalization ability to detect novel categories in practical applications. In this paper, we introduce a new task: open-vocabulary audio-visual semantic segmentation, extending AVSS task to open-world scenarios beyond the annotated label space. This is a more challenging task that requires recognizing all categories, even those that have never been seen nor heard during training. Moreover, we propose the first open-vocabulary AVSS framework, OV-AVSS, which mainly consists of two parts: 1) a universal sound source localization module to perform audio-visual fusion and locate all potential sounding objects and 2) an open-vocabulary classification module to predict categories with the help of the prior knowledge from large-scale pre-trained vision-language models. To properly evaluate the open-vocabulary AVSS, we split zero-shot training and testing subsets based on the AVSBench-semantic benchmark, namely AVSBench-OV. Extensive experiments demonstrate the strong segmentation and zero-shot generalization ability of our model on all categories. On the AVSBench-OV dataset, OV-AVSS achieves 55.43% mIoU on base categories and 29.14% mIoU on novel categories, exceeding the state-of-the-art zero-shot method by 41.88%/20.61% and open-vocabulary method by 10.2%/11.6%. The code is available at https://github.com/ruohaoguo/ovavss.

Open-Vocabulary Audio-Visuele Semantische Segmentatie

Open-Vocabulary Audio-Visual Semantic Segmentation

Samenvatting

Support