「チャープ」と「チャット」を分離する：音声と言語の自己教師あり視覚的基盤付け

要旨

本論文では、DenseAVという新しいデュアルエンコーダグラウンディングアーキテクチャを提案します。DenseAVは、ビデオを視聴するだけで、高解像度で意味的に有意義かつ音声と視覚が整列した特徴を学習します。DenseAVが、明示的な位置特定の教師信号なしに、単語の「意味」や音の「位置」を発見できることを示します。さらに、これら2種類の関連性を教師なしで自動的に発見し、区別します。DenseAVの位置特定能力は、密な画像と音声表現を直接比較してコントラスティブ学習を行う新しいマルチヘッド特徴集約演算子から生じることを示します。一方で、「グローバル」な音声とビデオ表現を学習する他の多くのシステムは、単語や音を位置特定できません。最後に、音声と音に基づくセマンティックセグメンテーションを通じてAV表現の評価を改善するための2つの新しいデータセットを提供します。これらのデータセットおよび他のデータセットにおいて、DenseAVが音声と音に基づくセマンティックセグメンテーションにおいて従来の技術を大幅に上回ることを示します。DenseAVは、パラメータ数を半分以下に抑えながら、クロスモーダル検索において以前の最先端技術であるImageBindを上回ります。プロジェクトページ: https://aka.ms/denseav{https://aka.ms/denseav}

English

We present DenseAV, a novel dual encoder grounding architecture that learns high-resolution, semantically meaningful, and audio-visually aligned features solely through watching videos. We show that DenseAV can discover the ``meaning'' of words and the ``location'' of sounds without explicit localization supervision. Furthermore, it automatically discovers and distinguishes between these two types of associations without supervision. We show that DenseAV's localization abilities arise from a new multi-head feature aggregation operator that directly compares dense image and audio representations for contrastive learning. In contrast, many other systems that learn ``global'' audio and video representations cannot localize words and sound. Finally, we contribute two new datasets to improve the evaluation of AV representations through speech and sound prompted semantic segmentation. On these and other datasets we show DenseAV dramatically outperforms the prior art on speech and sound prompted semantic segmentation. DenseAV outperforms the previous state-of-the-art, ImageBind, on cross-modal retrieval using fewer than half of the parameters. Project Page: https://aka.ms/denseav{https://aka.ms/denseav}

「チャープ」と「チャット」を分離する：音声と言語の自己教師あり視覚的基盤付け

Separating the "Chirp" from the "Chat": Self-supervised Visual Grounding of Sound and Language

要旨

Support