ChatPaper.aiChatPaper

扩散链接:基于扩散概率模型的音频-文本模态鸿沟跨越

Diffusion-Link: Diffusion Probabilistic Model for Bridging the Audio-Text Modality Gap

October 13, 2025
作者: KiHyun Nam, Jongmin Choi, Hyeongkeun Lee, Jungwoo Heo, Joon Son Chung
cs.AI

摘要

对比音频-语言预训练能够生成强大的联合表征,然而,持续的音频-文本模态鸿沟限制了多模态编码器与大型语言模型(LLMs)耦合的效益。我们提出了Diffusion-Link,一种基于扩散的模态桥接模块,它通过生成方式将音频嵌入映射到文本嵌入分布中。该模块在冻结的多模态编码器输出的嵌入上进行训练,并实现为一个包含三个残差多层感知机(MLP)块的轻量级网络。为了评估Diffusion-Link对多模态编码器-LLM耦合的影响,我们在自动音频描述(AAC)任务上进行了测试;据我们所知,这是首次将基于扩散的模态桥接应用于AAC。我们报告了两项结果。(1)模态鸿沟分析:在相似性和几何标准上,Diffusion-Link在现有基于扩散的方法中最大程度地减少了模态鸿沟,并显示出音频嵌入向文本分布的整体迁移。(2)下游AAC任务:将Diffusion-Link附加到相同的多模态LLM基线模型上,在无需外部知识的情况下,在AudioCaps数据集上的零样本和全监督描述任务中均达到了最先进的性能,相对增益分别高达52.5%和7.5%。这些发现表明,缩小模态鸿沟对于多模态编码器与LLMs之间的有效耦合至关重要,而基于扩散的模态桥接为超越以知识检索为中心的设计提供了有前景的方向。代码将在论文被接受后发布于https://github.com/DevKiHyun/Diffusion-Link。
English
Contrastive audio-language pretraining yields powerful joint representations, yet a persistent audio-text modality gap limits the benefits of coupling multimodal encoders with large language models (LLMs). We present Diffusion-Link, a diffusion-based modality-bridging module that generatively maps audio embeddings into the text-embedding distribution. The module is trained at the output embedding from the frozen multimodal encoder and implemented as a lightweight network with three residual MLP blocks. To assess the effect of Diffusion-Link on multimodal encoder-LLM coupling, we evaluate on Automatic Audio Captioning (AAC); to our knowledge, this is the first application of diffusion-based modality bridging to AAC. We report two results. (1) Modality-gap analysis: on similarity and geometric criteria, Diffusion-Link reduces the modality gap the most among prior diffusion-based methods and shows a collective migration of audio embeddings toward the text distribution. (2) Downstream AAC: attaching Diffusion-Link to the same multimodal LLM baseline achieves state-of-the-art on AudioCaps in both zero-shot and fully supervised captioning without external knowledge, with relative gains up to 52.5% and 7.5%, respectively. These findings show that closing the modality gap is pivotal for effective coupling between multimodal encoders and LLMs, and diffusion-based modality bridging offers a promising direction beyond knowledge-retrieval-centric designs. Code will be released upon acceptance https://github.com/DevKiHyun/Diffusion-Link
PDF12October 15, 2025