MI-Fuse: 폐쇄형 대규모 오디오 언어 모델을 활용한 비지도 도메인 적응을 위한 레이블 융합

초록

대규모 오디오-언어 모델(LALMs)은 음성 작업에서 강력한 제로샷 능력을 보여주며, 이는 음성 감정 인식(SER)에 대한 가능성을 시사합니다. 그러나 실제 배포 환경에서의 SER은 도메인 불일치 상황에서 종종 실패하는데, 이는 소스 데이터를 사용할 수 없고 강력한 LALMs가 API를 통해서만 접근 가능하기 때문입니다. 우리는 다음과 같은 질문을 던집니다: 레이블이 없는 타겟 도메인 오디오와 API로만 접근 가능한 LALM이 주어졌을 때, 학생 모델을 타겟 도메인에서 LALM을 능가하도록 적응시킬 수 있을까요? 이를 위해 우리는 MI-Fuse라는 노이즈 제거 레이블 융합 프레임워크를 제안합니다. 이 프레임워크는 LALM을 보조 교사로 사용하는 소스 도메인에서 훈련된 SER 분류기를 보완합니다. 이 프레임워크는 두 교사로부터 여러 확률적 예측을 도출하고, 상호 정보 기반 불확실성으로 평균 분포에 가중치를 부여하며, 지수 이동 평균 교사를 사용하여 훈련을 안정화합니다. 세 가지 공개 감정 데이터셋과 여섯 가지 교차 도메인 전이 실험에서 일관된 성능 향상을 보였으며, 학생 모델이 LALM을 능가하고 가장 강력한 베이스라인을 3.9% 앞섰습니다. 이 접근 방식은 소스 데이터를 공유하지 않고도 감정 인식 음성 시스템을 강화하여 현실적인 적응을 가능하게 합니다.

English

Large audio-language models (LALMs) show strong zero-shot ability on speech tasks, suggesting promise for speech emotion recognition (SER). However, SER in real-world deployments often fails under domain mismatch, where source data are unavailable and powerful LALMs are accessible only through an API. We ask: given only unlabeled target-domain audio and an API-only LALM, can a student model be adapted to outperform the LALM in the target domain? To this end, we propose MI-Fuse, a denoised label fusion framework that supplements the LALM with a source-domain trained SER classifier as an auxiliary teacher. The framework draws multiple stochastic predictions from both teachers, weights their mean distributions by mutual-information-based uncertainty, and stabilizes training with an exponential moving average teacher. Experiments across three public emotion datasets and six cross-domain transfers show consistent gains, with the student surpassing the LALM and outperforming the strongest baseline by 3.9%. This approach strengthens emotion-aware speech systems without sharing source data, enabling realistic adaptation.

MI-Fuse: 폐쇄형 대규모 오디오 언어 모델을 활용한 비지도 도메인 적응을 위한 레이블 융합

MI-Fuse: Label Fusion for Unsupervised Domain Adaptation with Closed-Source Large-Audio Language Model

초록

Support