MI-Fuse：基于闭源大规模音频语言模型的无监督领域自适应标签融合

摘要

大规模音频-语言模型（LALMs）在语音任务上展现出强大的零样本能力，为语音情感识别（SER）带来了希望。然而，在实际部署中，SER往往因领域不匹配而失效，此时源数据不可用，且强大的LALMs仅能通过API访问。我们提出疑问：在仅有未标注的目标域音频和仅API可访问的LALM的情况下，能否通过学生模型的适应，使其在目标域中超越LALM？为此，我们提出了MI-Fuse，一种去噪标签融合框架，该框架通过引入一个在源域上训练的SER分类器作为辅助教师，来补充LALM。该框架从两位教师处获取多重随机预测，基于互信息的不确定性对它们的平均分布进行加权，并通过指数移动平均教师来稳定训练过程。在三个公开情感数据集和六次跨领域迁移实验中的结果表明，该方法带来了持续的提升，学生模型不仅超越了LALM，还比最强基线高出3.9%。这一方法无需共享源数据，即可增强情感感知语音系统，实现了现实的适应能力。

English

Large audio-language models (LALMs) show strong zero-shot ability on speech tasks, suggesting promise for speech emotion recognition (SER). However, SER in real-world deployments often fails under domain mismatch, where source data are unavailable and powerful LALMs are accessible only through an API. We ask: given only unlabeled target-domain audio and an API-only LALM, can a student model be adapted to outperform the LALM in the target domain? To this end, we propose MI-Fuse, a denoised label fusion framework that supplements the LALM with a source-domain trained SER classifier as an auxiliary teacher. The framework draws multiple stochastic predictions from both teachers, weights their mean distributions by mutual-information-based uncertainty, and stabilizes training with an exponential moving average teacher. Experiments across three public emotion datasets and six cross-domain transfers show consistent gains, with the student surpassing the LALM and outperforming the strongest baseline by 3.9%. This approach strengthens emotion-aware speech systems without sharing source data, enabling realistic adaptation.

MI-Fuse：基于闭源大规模音频语言模型的无监督领域自适应标签融合

MI-Fuse: Label Fusion for Unsupervised Domain Adaptation with Closed-Source Large-Audio Language Model

摘要

Support