ChatPaper.aiChatPaper

DIFFA-2:面向通用音频理解的实用扩散大语言模型

DIFFA-2: A Practical Diffusion Large Language Model for General Audio Understanding

January 30, 2026
作者: Jiaming Zhou, Xuxin Cheng, Shiwan Zhao, Yuhang Jia, Cao Liu, Ke Zeng, Xunliang Cai, Yong Qin
cs.AI

摘要

诸如Qwen-2.5-Omni的自回归式大型音频语言模型在音频理解与交互方面已展现强大性能,但其扩展仍需耗费大量数据与算力,且严格串行解码机制制约了推理效率。扩散大型语言模型近期被证明能有效利用有限训练数据,先前DIFFA的研究表明,在同等设置下用扩散架构替代自回归主干能显著提升音频理解能力,但该成果仅停留于概念验证规模,未进行大规模指令微调、偏好对齐或实用解码方案。我们推出DIFFA-2——一个面向通用音频理解的实用型扩散基大型音频语言模型。该模型升级了语音编码器,采用双语义与声学适配器,并通过四阶段课程学习(融合语义声学对齐、大规模监督微调及方差缩减偏好优化)进行训练,全程仅使用完全开源语料。在MMSU、MMAU与MMAR基准测试中,DIFFA-2相较DIFFA实现持续提升,并在实际训练成本下与主流自回归音频语言模型性能相当,印证扩散建模可作为大规模音频理解的可行架构支柱。代码已开源:https://github.com/NKU-HLT/DIFFA.git。
English
Autoregressive (AR) large audio language models (LALMs) such as Qwen-2.5-Omni have achieved strong performance on audio understanding and interaction, but scaling them remains costly in data and computation, and strictly sequential decoding limits inference efficiency. Diffusion large language models (dLLMs) have recently been shown to make effective use of limited training data, and prior work on DIFFA indicates that replacing an AR backbone with a diffusion counterpart can substantially improve audio understanding under matched settings, albeit at a proof-of-concept scale without large-scale instruction tuning, preference alignment, or practical decoding schemes. We introduce DIFFA-2, a practical diffusion-based LALM for general audio understanding. DIFFA-2 upgrades the speech encoder, employs dual semantic and acoustic adapters, and is trained with a four-stage curriculum that combines semantic and acoustic alignment, large-scale supervised fine-tuning, and variance-reduced preference optimization, using only fully open-source corpora. Experiments on MMSU, MMAU, and MMAR show that DIFFA-2 consistently improves over DIFFA and is competitive to strong AR LALMs under practical training budgets, supporting diffusion-based modeling is a viable backbone for large-scale audio understanding. Our code is available at https://github.com/NKU-HLT/DIFFA.git.
PDF93February 3, 2026