マルチモーダル領域一般化は進歩しているか？包括的ベンチマーク研究

要旨

マルチモーダルドメイン一般化（MMDG）のモデル頑健性向上への応用が注目を集める中、報告されている性能向上が真のアルゴリズムの進歩を反映しているのか、あるいは評価プロトコルの不一致による人為的結果なのかは未解明のままである。現状の研究は断片的で、データセット、モダリティ構成、実験設定において大きなばらつきが見られる。さらに、既存のベンチマークは行動認識に偏りがちで、入力データの破損、モダリティ欠損、モデルの信頼性といった現実世界の重要な課題が軽視されがちである。この標準化の欠如は、本分野の発展を適切に評価することを困難にしている。この問題を解決するため、我々はMMDG初の統合的かつ包括的なベンチマーク「MMDG-Bench」を提案する。本ベンチマークは、行動認識、機械故障診断、感情分析という3種類のタスクに跨る6つのデータセットを標準化し、6つのモダリティ組み合わせ、9つの代表的手法、複数の評価設定を包含する。標準的な精度評価に加え、破損データへの頑健性、モダリティ欠損状況での一般化性能、誤分類検出、分布外検出を体系的に評価する。95の異なるクロスドメインタスクにおいて合計7,402個のニューラルネットワークを学習させた大規模実験により、MMDG-Benchは以下の5つの主要な知見を得た：（1）公平な比較条件下では、近年の専門的なMMDG手法はERMベースラインと比べて僅かな改善しかもたらさない、（2）全てのデータセットまたはモダリティ組み合わせで一貫して他手法を上回る単一手法は存在しない、（3）性能上限との間に依然として大きな隔たりがあり、MMDGが未解決の課題であることを示唆する、（4）3モダリティ融合は最強の2モダリティ構成を一貫して上回るわけではない、（5）評価した全ての手法は、データ破損及びモダリティ欠損シナリオにおいて顕著な性能劣化を示し、一部の手法はモデルの信頼性をさらに損なう。

English

Despite the growing popularity of Multimodal Domain Generalization (MMDG) for enhancing model robustness, it remains unclear whether reported performance gains reflect genuine algorithmic progress or are artifacts of inconsistent evaluation protocols. Current research is fragmented, with studies varying significantly across datasets, modality configurations, and experimental settings. Furthermore, existing benchmarks focus predominantly on action recognition, often neglecting critical real-world challenges such as input corruptions, missing modalities, and model trustworthiness. This lack of standardization obscures a reliable assessment of the field's advancement. To address this issue, we introduce MMDG-Bench, the first unified and comprehensive benchmark for MMDG, which standardizes evaluation across six datasets spanning three diverse tasks: action recognition, mechanical fault diagnosis, and sentiment analysis. MMDG-Bench encompasses six modality combinations, nine representative methods, and multiple evaluation settings. Beyond standard accuracy, it systematically assesses corruption robustness, missing-modality generalization, misclassification detection, and out-of-distribution detection. With 7, 402 neural networks trained in total across 95 unique cross-domain tasks, MMDG-Bench yields five key findings: (1) under fair comparisons, recent specialized MMDG methods offer only marginal improvements over ERM baseline; (2) no single method consistently outperforms others across datasets or modality combinations; (3) a substantial gap to upper-bound performance persists, indicating that MMDG remains far from solved; (4) trimodal fusion does not consistently outperform the strongest bimodal configurations; and (5) all evaluated methods exhibit significant degradation under corruption and missing-modality scenarios, with some methods further compromising model trustworthiness.

マルチモーダル領域一般化は進歩しているか？包括的ベンチマーク研究

Are We Making Progress in Multimodal Domain Generalization? A Comprehensive Benchmark Study

要旨

Support