MI-Fuse:基于闭源大规模音频语言模型的无监督领域自适应标签融合
MI-Fuse: Label Fusion for Unsupervised Domain Adaptation with Closed-Source Large-Audio Language Model
September 25, 2025
作者: Hsiao-Ying Huang, Yi-Cheng Lin, Hung-yi Lee
cs.AI
摘要
大规模音频-语言模型(LALMs)在语音任务上展现出强大的零样本能力,为语音情感识别(SER)带来了希望。然而,在实际部署中,SER往往因领域不匹配而失效,此时源数据不可用,且强大的LALMs仅能通过API访问。我们提出疑问:在仅有未标注的目标域音频和仅API可访问的LALM的情况下,能否通过学生模型的适应,使其在目标域中超越LALM?为此,我们提出了MI-Fuse,一种去噪标签融合框架,该框架通过引入一个在源域上训练的SER分类器作为辅助教师,来补充LALM。该框架从两位教师处获取多重随机预测,基于互信息的不确定性对它们的平均分布进行加权,并通过指数移动平均教师来稳定训练过程。在三个公开情感数据集和六次跨领域迁移实验中的结果表明,该方法带来了持续的提升,学生模型不仅超越了LALM,还比最强基线高出3.9%。这一方法无需共享源数据,即可增强情感感知语音系统,实现了现实的适应能力。
English
Large audio-language models (LALMs) show strong zero-shot ability on speech
tasks, suggesting promise for speech emotion recognition (SER). However, SER in
real-world deployments often fails under domain mismatch, where source data are
unavailable and powerful LALMs are accessible only through an API. We ask:
given only unlabeled target-domain audio and an API-only LALM, can a student
model be adapted to outperform the LALM in the target domain? To this end, we
propose MI-Fuse, a denoised label fusion framework that supplements the LALM
with a source-domain trained SER classifier as an auxiliary teacher. The
framework draws multiple stochastic predictions from both teachers, weights
their mean distributions by mutual-information-based uncertainty, and
stabilizes training with an exponential moving average teacher. Experiments
across three public emotion datasets and six cross-domain transfers show
consistent gains, with the student surpassing the LALM and outperforming the
strongest baseline by 3.9%. This approach strengthens emotion-aware speech
systems without sharing source data, enabling realistic adaptation.