ChatPaper.aiChatPaper

擴散連結:基於擴散概率模型跨越音頻-文本模態鴻溝

Diffusion-Link: Diffusion Probabilistic Model for Bridging the Audio-Text Modality Gap

October 13, 2025
作者: KiHyun Nam, Jongmin Choi, Hyeongkeun Lee, Jungwoo Heo, Joon Son Chung
cs.AI

摘要

對比式音頻-語言預訓練能夠產生強大的聯合表徵,然而持續存在的音頻-文本模態差距限制了多模態編碼器與大型語言模型(LLMs)耦合的效益。我們提出了Diffusion-Link,這是一種基於擴散的模態橋接模塊,它通過生成方式將音頻嵌入映射到文本嵌入分佈中。該模塊在凍結的多模態編碼器的輸出嵌入處進行訓練,並實現為一個包含三個殘差MLP塊的輕量級網絡。為了評估Diffusion-Link對多模態編碼器-LLM耦合的影響,我們在自動音頻描述(AAC)任務上進行了評估;據我們所知,這是首次將基於擴散的模態橋接應用於AAC。我們報告了兩項結果。(1)模態差距分析:在相似性和幾何標準上,Diffusion-Link在現有的基於擴散的方法中最大程度地減小了模態差距,並顯示出音頻嵌入向文本分佈的集體遷移。(2)下游AAC:將Diffusion-Link附加到相同的多模態LLM基線上,在AudioCaps數據集上實現了零樣本和全監督描述的最新水平,無需外部知識,相對增益分別高達52.5%和7.5%。這些發現表明,縮小模態差距對於多模態編碼器與LLMs之間的有效耦合至關重要,而基於擴散的模態橋接提供了一條超越以知識檢索為中心的設計的可行方向。代碼將在論文接受後發佈於https://github.com/DevKiHyun/Diffusion-Link。
English
Contrastive audio-language pretraining yields powerful joint representations, yet a persistent audio-text modality gap limits the benefits of coupling multimodal encoders with large language models (LLMs). We present Diffusion-Link, a diffusion-based modality-bridging module that generatively maps audio embeddings into the text-embedding distribution. The module is trained at the output embedding from the frozen multimodal encoder and implemented as a lightweight network with three residual MLP blocks. To assess the effect of Diffusion-Link on multimodal encoder-LLM coupling, we evaluate on Automatic Audio Captioning (AAC); to our knowledge, this is the first application of diffusion-based modality bridging to AAC. We report two results. (1) Modality-gap analysis: on similarity and geometric criteria, Diffusion-Link reduces the modality gap the most among prior diffusion-based methods and shows a collective migration of audio embeddings toward the text distribution. (2) Downstream AAC: attaching Diffusion-Link to the same multimodal LLM baseline achieves state-of-the-art on AudioCaps in both zero-shot and fully supervised captioning without external knowledge, with relative gains up to 52.5% and 7.5%, respectively. These findings show that closing the modality gap is pivotal for effective coupling between multimodal encoders and LLMs, and diffusion-based modality bridging offers a promising direction beyond knowledge-retrieval-centric designs. Code will be released upon acceptance https://github.com/DevKiHyun/Diffusion-Link
PDF12October 15, 2025