MI-Fuse: 大規模音声言語モデルを用いた教師なしドメイン適応のためのラベル融合

要旨

大規模音声言語モデル（LALMs）は、音声タスクにおいて強力なゼロショット能力を示し、音声感情認識（SER）への期待が高まっている。しかし、実世界でのSERの展開では、ソースデータが利用不可能であり、強力なLALMsがAPI経由でのみアクセス可能な場合、ドメインミスマッチの下でしばしば失敗する。そこで、ラベル付けされていないターゲットドメインの音声とAPIのみのLALMが与えられた場合、ターゲットドメインにおいてLALMを上回るように学生モデルを適応させることができるか、という問いを立てる。この目的のために、MI-Fuseを提案する。これは、LALMを補完するためにソースドメインで訓練されたSER分類器を補助教師として用いる、ノイズ除去されたラベル融合フレームワークである。このフレームワークは、両方の教師から複数の確率的予測を引き出し、相互情報量に基づく不確実性によってそれらの平均分布を重み付けし、指数移動平均教師を用いて訓練を安定化する。3つの公開感情データセットと6つのクロスドメイン転送にわたる実験では、一貫した向上が見られ、学生モデルがLALMを上回り、最も強力なベースラインを3.9%上回る結果を示した。このアプローチは、ソースデータを共有することなく、感情認識音声システムを強化し、現実的な適応を可能にする。

English

Large audio-language models (LALMs) show strong zero-shot ability on speech tasks, suggesting promise for speech emotion recognition (SER). However, SER in real-world deployments often fails under domain mismatch, where source data are unavailable and powerful LALMs are accessible only through an API. We ask: given only unlabeled target-domain audio and an API-only LALM, can a student model be adapted to outperform the LALM in the target domain? To this end, we propose MI-Fuse, a denoised label fusion framework that supplements the LALM with a source-domain trained SER classifier as an auxiliary teacher. The framework draws multiple stochastic predictions from both teachers, weights their mean distributions by mutual-information-based uncertainty, and stabilizes training with an exponential moving average teacher. Experiments across three public emotion datasets and six cross-domain transfers show consistent gains, with the student surpassing the LALM and outperforming the strongest baseline by 3.9%. This approach strengthens emotion-aware speech systems without sharing source data, enabling realistic adaptation.

MI-Fuse: 大規模音声言語モデルを用いた教師なしドメイン適応のためのラベル融合

MI-Fuse: Label Fusion for Unsupervised Domain Adaptation with Closed-Source Large-Audio Language Model

要旨

Support