MI-Fuse:基於閉源大規模音頻語言模型的無監督領域自適應標籤融合
MI-Fuse: Label Fusion for Unsupervised Domain Adaptation with Closed-Source Large-Audio Language Model
September 25, 2025
作者: Hsiao-Ying Huang, Yi-Cheng Lin, Hung-yi Lee
cs.AI
摘要
大型音頻語言模型(LALMs)在語音任務上展現出強大的零樣本能力,這為語音情感識別(SER)帶來了希望。然而,在實際部署中,SER往往因領域不匹配而失敗,此時源數據不可用,且強大的LALMs僅能通過API訪問。我們提出疑問:在僅有未標記的目標領域音頻和僅能通過API訪問的LALM的情況下,能否讓一個學生模型適應並在目標領域中超越LALM?為此,我們提出了MI-Fuse,這是一個去噪標籤融合框架,它通過一個在源領域訓練的SER分類器作為輔助教師來補充LALM。該框架從兩位教師中抽取多個隨機預測,基於互信息的不確定性加權其平均分佈,並通過指數移動平均教師來穩定訓練。在三個公開情感數據集和六次跨領域遷移的實驗中,均顯示出持續的增益,學生模型不僅超越了LALM,還比最強的基線模型高出3.9%。這一方法在不共享源數據的情況下增強了情感感知語音系統,實現了現實的適應性。
English
Large audio-language models (LALMs) show strong zero-shot ability on speech
tasks, suggesting promise for speech emotion recognition (SER). However, SER in
real-world deployments often fails under domain mismatch, where source data are
unavailable and powerful LALMs are accessible only through an API. We ask:
given only unlabeled target-domain audio and an API-only LALM, can a student
model be adapted to outperform the LALM in the target domain? To this end, we
propose MI-Fuse, a denoised label fusion framework that supplements the LALM
with a source-domain trained SER classifier as an auxiliary teacher. The
framework draws multiple stochastic predictions from both teachers, weights
their mean distributions by mutual-information-based uncertainty, and
stabilizes training with an exponential moving average teacher. Experiments
across three public emotion datasets and six cross-domain transfers show
consistent gains, with the student surpassing the LALM and outperforming the
strongest baseline by 3.9%. This approach strengthens emotion-aware speech
systems without sharing source data, enabling realistic adaptation.