FLAP: 高速言語-音声事前学習

要旨

我々は、Fast Language-Audio Pre-training (FLAP) を提案する。これは、マスキング、コントラスティブ学習、再構成を通じて、効率的かつ効果的に音声と言語の表現を整列させる自己教師ありアプローチである。効率性を重視し、FLAPは音声スペクトログラムトークンをランダムにドロップし、残りのトークンにのみ焦点を当てて自己教師を行う。モダリティ間コントラスティブ学習を通じて、FLAPはペアとなった音声とテキストの表現を共有潜在空間に整列させる方法を学ぶ。特に、FLAPはマスキングによる複数の拡張ビューを活用してモダリティ間コントラストを行い、マスクされた音声トークンの部分を再構成する方法を学ぶ。さらに、FLAPは大規模言語モデル（LLM）を活用してテキスト入力を拡張し、性能向上に寄与する。これらのアプローチにより、よりロバストで情報量の多い音声-テキスト表現が得られ、FLAPはAudioCaps（R@1 53.0%を達成）およびClotho（R@1 25.5%を達成）における音声-テキスト検索タスクで最先端（SoTA）の性能を達成する。

English

We propose Fast Language-Audio Pre-training (FLAP), a self-supervised approach that efficiently and effectively learns aligned audio and language representations through masking, contrastive learning and reconstruction. For efficiency, FLAP randomly drops audio spectrogram tokens, focusing solely on the remaining ones for self-supervision. Through inter-modal contrastive learning, FLAP learns to align paired audio and text representations in a shared latent space. Notably, FLAP leverages multiple augmented views via masking for inter-modal contrast and learns to reconstruct the masked portion of audio tokens. Moreover, FLAP leverages large language models (LLMs) to augment the text inputs, contributing to improved performance. These approaches lead to more robust and informative audio-text representations, enabling FLAP to achieve state-of-the-art (SoTA) performance on audio-text retrieval tasks on AudioCaps (achieving 53.0% R@1) and Clotho (achieving 25.5% R@1).

FLAP: 高速言語-音声事前学習

FLAP: Fast Language-Audio Pre-training

要旨

Support