MI-Fuse：基於閉源大規模音頻語言模型的無監督領域自適應標籤融合

摘要

大型音頻語言模型（LALMs）在語音任務上展現出強大的零樣本能力，這為語音情感識別（SER）帶來了希望。然而，在實際部署中，SER往往因領域不匹配而失敗，此時源數據不可用，且強大的LALMs僅能通過API訪問。我們提出疑問：在僅有未標記的目標領域音頻和僅能通過API訪問的LALM的情況下，能否讓一個學生模型適應並在目標領域中超越LALM？為此，我們提出了MI-Fuse，這是一個去噪標籤融合框架，它通過一個在源領域訓練的SER分類器作為輔助教師來補充LALM。該框架從兩位教師中抽取多個隨機預測，基於互信息的不確定性加權其平均分佈，並通過指數移動平均教師來穩定訓練。在三個公開情感數據集和六次跨領域遷移的實驗中，均顯示出持續的增益，學生模型不僅超越了LALM，還比最強的基線模型高出3.9%。這一方法在不共享源數據的情況下增強了情感感知語音系統，實現了現實的適應性。

English

Large audio-language models (LALMs) show strong zero-shot ability on speech tasks, suggesting promise for speech emotion recognition (SER). However, SER in real-world deployments often fails under domain mismatch, where source data are unavailable and powerful LALMs are accessible only through an API. We ask: given only unlabeled target-domain audio and an API-only LALM, can a student model be adapted to outperform the LALM in the target domain? To this end, we propose MI-Fuse, a denoised label fusion framework that supplements the LALM with a source-domain trained SER classifier as an auxiliary teacher. The framework draws multiple stochastic predictions from both teachers, weights their mean distributions by mutual-information-based uncertainty, and stabilizes training with an exponential moving average teacher. Experiments across three public emotion datasets and six cross-domain transfers show consistent gains, with the student surpassing the LALM and outperforming the strongest baseline by 3.9%. This approach strengthens emotion-aware speech systems without sharing source data, enabling realistic adaptation.

MI-Fuse：基於閉源大規模音頻語言模型的無監督領域自適應標籤融合

MI-Fuse: Label Fusion for Unsupervised Domain Adaptation with Closed-Source Large-Audio Language Model

摘要

Support