FLAP：快速语言音频预训练

摘要

我们提出了快速语音-文本预训练（FLAP）方法，这是一种自监督方法，通过遮盖、对比学习和重构有效地学习对齐的音频和文本表示。为了提高效率，FLAP随机丢弃音频频谱令牌，仅专注于剩余的令牌进行自我监督。通过跨模态对比学习，FLAP学习将配对的音频和文本表示对齐到共享的潜在空间中。值得注意的是，FLAP通过遮盖实现了多个增强视图以进行跨模态对比学习，并学习重构音频令牌的遮盖部分。此外，FLAP利用大型语言模型（LLMs）来增强文本输入，有助于提高性能。这些方法导致更加稳健和信息丰富的音频-文本表示，使FLAP在AudioCaps（达到53.0% R@1）和Clotho（达到25.5% R@1）的音频-文本检索任务中取得了最先进的性能。

English

We propose Fast Language-Audio Pre-training (FLAP), a self-supervised approach that efficiently and effectively learns aligned audio and language representations through masking, contrastive learning and reconstruction. For efficiency, FLAP randomly drops audio spectrogram tokens, focusing solely on the remaining ones for self-supervision. Through inter-modal contrastive learning, FLAP learns to align paired audio and text representations in a shared latent space. Notably, FLAP leverages multiple augmented views via masking for inter-modal contrast and learns to reconstruct the masked portion of audio tokens. Moreover, FLAP leverages large language models (LLMs) to augment the text inputs, contributing to improved performance. These approaches lead to more robust and informative audio-text representations, enabling FLAP to achieve state-of-the-art (SoTA) performance on audio-text retrieval tasks on AudioCaps (achieving 53.0% R@1) and Clotho (achieving 25.5% R@1).

FLAP：快速语言音频预训练

FLAP: Fast Language-Audio Pre-training

摘要

Support