FLAP: 빠른 언어-오디오 사전 학습

초록

우리는 마스킹, 대조 학습 및 재구성을 통해 오디오와 언어 표현을 효율적이고 효과적으로 정렬하여 학습하는 자기 지도 방식인 Fast Language-Audio Pre-training(FLAP)을 제안한다. 효율성을 위해 FLAP은 오디오 스펙트로그램 토큰을 무작위로 제거하고, 남은 토큰에만 집중하여 자기 지도를 수행한다. FLAP은 인터모달 대조 학습을 통해 공유 잠재 공간에서 짝을 이루는 오디오와 텍스트 표현을 정렬하는 방법을 학습한다. 특히, FLAP은 마스킹을 통해 생성된 다중 증강 뷰를 활용하여 인터모달 대조를 수행하고, 마스킹된 오디오 토큰 부분을 재구성하는 방법을 학습한다. 또한, FLAP은 대형 언어 모델(LLM)을 활용하여 텍스트 입력을 증강시켜 성능 향상에 기여한다. 이러한 접근 방식은 더욱 견고하고 정보가 풍부한 오디오-텍스트 표현을 가능하게 하여, FLAP이 AudioCaps(53.0% R@1 달성)와 Clotho(25.5% R@1 달성)에서 오디오-텍스트 검색 작업에서 최첨단(SoTA) 성능을 달성할 수 있도록 한다.

English

We propose Fast Language-Audio Pre-training (FLAP), a self-supervised approach that efficiently and effectively learns aligned audio and language representations through masking, contrastive learning and reconstruction. For efficiency, FLAP randomly drops audio spectrogram tokens, focusing solely on the remaining ones for self-supervision. Through inter-modal contrastive learning, FLAP learns to align paired audio and text representations in a shared latent space. Notably, FLAP leverages multiple augmented views via masking for inter-modal contrast and learns to reconstruct the masked portion of audio tokens. Moreover, FLAP leverages large language models (LLMs) to augment the text inputs, contributing to improved performance. These approaches lead to more robust and informative audio-text representations, enabling FLAP to achieve state-of-the-art (SoTA) performance on audio-text retrieval tasks on AudioCaps (achieving 53.0% R@1) and Clotho (achieving 25.5% R@1).

FLAP: 빠른 언어-오디오 사전 학습

FLAP: Fast Language-Audio Pre-training

초록

Support