將「Chirp」與「Chat」分開:自我監督的聲音和語言視覺對應
Separating the "Chirp" from the "Chat": Self-supervised Visual Grounding of Sound and Language
June 9, 2024
作者: Mark Hamilton, Andrew Zisserman, John R. Hershey, William T. Freeman
cs.AI
摘要
我們提出了DenseAV,一種新穎的雙編碼對齊架構,通過觀看視頻僅學習高分辨率、語義有意義且視聽對齊的特徵。我們展示了DenseAV可以在沒有明確定位監督的情況下發現單詞的「含義」和聲音的「位置」。此外,它可以在沒有監督的情況下自動發現和區分這兩種關聯類型。我們展示了DenseAV的定位能力來自一個新的多頭特徵聚合運算子,該運算子直接比較密集的圖像和音頻表示以進行對比學習。相比之下,許多學習「全局」音頻和視頻表示的其他系統無法定位單詞和聲音。最後,我們提供了兩個新數據集,以改進通過語音和聲音提示的語義分割的評估。在這些數據集和其他數據集上,我們展示了DenseAV在語音和聲音提示的語義分割方面遠遠優於先前的技術。DenseAV在跨模態檢索方面的表現優於之前的最新技術ImageBind,並且使用的參數不到一半。項目頁面:https://aka.ms/denseav{https://aka.ms/denseav}
English
We present DenseAV, a novel dual encoder grounding architecture that learns
high-resolution, semantically meaningful, and audio-visually aligned features
solely through watching videos. We show that DenseAV can discover the
``meaning'' of words and the ``location'' of sounds without explicit
localization supervision. Furthermore, it automatically discovers and
distinguishes between these two types of associations without supervision. We
show that DenseAV's localization abilities arise from a new multi-head feature
aggregation operator that directly compares dense image and audio
representations for contrastive learning. In contrast, many other systems that
learn ``global'' audio and video representations cannot localize words and
sound. Finally, we contribute two new datasets to improve the evaluation of AV
representations through speech and sound prompted semantic segmentation. On
these and other datasets we show DenseAV dramatically outperforms the prior art
on speech and sound prompted semantic segmentation. DenseAV outperforms the
previous state-of-the-art, ImageBind, on cross-modal retrieval using fewer than
half of the parameters. Project Page:
https://aka.ms/denseav{https://aka.ms/denseav}Summary
AI-Generated Summary