我们在多模态领域泛化方面是否取得进展？一项综合性基准研究

摘要

尽管多模态领域泛化（MMDG）在提升模型鲁棒性方面日益受到关注，但现有性能提升究竟源于真正的算法进步还是评估标准不一致，目前尚不明确。当前研究呈现碎片化态势，不同研究在数据集、模态配置和实验设置上存在显著差异。此外，现有基准主要集中于动作识别任务，往往忽略了输入损坏、模态缺失和模型可信度等关键现实挑战。这种标准化缺失阻碍了对该领域发展的可靠评估。为解决这一问题，我们推出首个统一全面的MMDG基准测试框架MMDG-Bench，通过对涵盖动作识别、机械故障诊断和情感分析三大任务的六个数据集实现标准化评估。该框架包含六种模态组合、九种代表性方法及多重评估场景，除标准准确率外，系统评估了损坏鲁棒性、缺失模态泛化能力、误分类检测和分布外检测性能。通过在95个独特跨领域任务上训练总计7,402个神经网络，我们获得五项关键发现：（1）在公平比较下，近期专用MMDG方法相较ERM基线仅实现边际提升；（2）没有任何方法能在不同数据集或模态组合中持续领先；（3）现有性能与理论上限存在显著差距，表明MMDG远未得到解决；（4）三模态融合并未持续优于最强的双模态配置；（5）所有方法在损坏和缺失模态场景下均出现显著性能退化，部分方法还会削弱模型可信度。

English

Despite the growing popularity of Multimodal Domain Generalization (MMDG) for enhancing model robustness, it remains unclear whether reported performance gains reflect genuine algorithmic progress or are artifacts of inconsistent evaluation protocols. Current research is fragmented, with studies varying significantly across datasets, modality configurations, and experimental settings. Furthermore, existing benchmarks focus predominantly on action recognition, often neglecting critical real-world challenges such as input corruptions, missing modalities, and model trustworthiness. This lack of standardization obscures a reliable assessment of the field's advancement. To address this issue, we introduce MMDG-Bench, the first unified and comprehensive benchmark for MMDG, which standardizes evaluation across six datasets spanning three diverse tasks: action recognition, mechanical fault diagnosis, and sentiment analysis. MMDG-Bench encompasses six modality combinations, nine representative methods, and multiple evaluation settings. Beyond standard accuracy, it systematically assesses corruption robustness, missing-modality generalization, misclassification detection, and out-of-distribution detection. With 7, 402 neural networks trained in total across 95 unique cross-domain tasks, MMDG-Bench yields five key findings: (1) under fair comparisons, recent specialized MMDG methods offer only marginal improvements over ERM baseline; (2) no single method consistently outperforms others across datasets or modality combinations; (3) a substantial gap to upper-bound performance persists, indicating that MMDG remains far from solved; (4) trimodal fusion does not consistently outperform the strongest bimodal configurations; and (5) all evaluated methods exhibit significant degradation under corruption and missing-modality scenarios, with some methods further compromising model trustworthiness.

我们在多模态领域泛化方面是否取得进展？一项综合性基准研究

Are We Making Progress in Multimodal Domain Generalization? A Comprehensive Benchmark Study

摘要

Support