다중모드 도메인 일반화에서 진전이 있는가: 포괄적 벤치마크 연구

초록

다중 모달 도메인 일반화(MMDG)의 모델 강건성 향상 효과에 대한 관심이 높아지고 있지만, 보고된 성능 향상이 진정한 알고리즘적 진전을 반영하는지, 아니면 일관되지 않은 평가 프로토콜의 결과인지는 여전히 불분명합니다. 현재 연구는 데이터셋, 모달리티 구성, 실험 설정에 따라 상이하게 진행되어 파편화된 상태입니다. 더욱이 기존 벤치마크는 주로 행동 인식에 집중하여 입력 손상, 모달리티 결합, 모델 신뢰성과 같은 중요한 현실적 과제를 종종 간과하고 있습니다. 이러한 표준화 부재는 해당 분야의 진전을 신뢰롭게 평가하는 데 장애가 되고 있습니다. 이러한 문제를 해결하기 위해 우리는 MMDG 최초의 통합적이고 포괄적인 벤치마크인 MMDG-Bench를 소개합니다. MMDG-Bench는 행동 인식, 기계적 고장 진단, 감성 분석이라는 세 가지 다양한 과제에 걸친 6개 데이터셋에서 평가를 표준화합니다. MMDG-Bench는 6가지 모달리티 조합, 9가지 대표 방법, 다양한 평가 설정을 포함합니다. 표준 정확도 외에도 손상 강건성, 결합 모달리티 일반화, 오분류 탐지, 분포 외 탐지를 체계적으로 평가합니다. 95개의 독특한 교차 도메인 과제에 걸쳐 총 7,402개의 신경망을 학습시킨 MMDG-Bench는 다음과 같은 다섯 가지 주요 결과를 도출했습니다: (1) 공정한 비교 하에서 최근의 전문화된 MMDG 방법은 ERM 기준선 대비 미미한 향상만 제공합니다; (2) 단일 방법이 모든 데이터셋이나 모달리티 조합에서 일관되게 다른 방법들을 능가하지는 않습니다; (3) 상한선 성능까지 상당한 격차가 지속되어 MMDG 문제가 해결되기에는 아직 멀었음을 시사합니다; (4) 3개 모달리티 융합이 가장 강력한 2개 모달리티 구성을 일관되게 능가하지는 않습니다; (5) 평가된 모든 방법은 손상 및 모달리티 결합 시나리오에서 성능이 현저히 저하되며, 일부 방법은 모델 신뢰성을 추가로 훼손합니다.

English

Despite the growing popularity of Multimodal Domain Generalization (MMDG) for enhancing model robustness, it remains unclear whether reported performance gains reflect genuine algorithmic progress or are artifacts of inconsistent evaluation protocols. Current research is fragmented, with studies varying significantly across datasets, modality configurations, and experimental settings. Furthermore, existing benchmarks focus predominantly on action recognition, often neglecting critical real-world challenges such as input corruptions, missing modalities, and model trustworthiness. This lack of standardization obscures a reliable assessment of the field's advancement. To address this issue, we introduce MMDG-Bench, the first unified and comprehensive benchmark for MMDG, which standardizes evaluation across six datasets spanning three diverse tasks: action recognition, mechanical fault diagnosis, and sentiment analysis. MMDG-Bench encompasses six modality combinations, nine representative methods, and multiple evaluation settings. Beyond standard accuracy, it systematically assesses corruption robustness, missing-modality generalization, misclassification detection, and out-of-distribution detection. With 7, 402 neural networks trained in total across 95 unique cross-domain tasks, MMDG-Bench yields five key findings: (1) under fair comparisons, recent specialized MMDG methods offer only marginal improvements over ERM baseline; (2) no single method consistently outperforms others across datasets or modality combinations; (3) a substantial gap to upper-bound performance persists, indicating that MMDG remains far from solved; (4) trimodal fusion does not consistently outperform the strongest bimodal configurations; and (5) all evaluated methods exhibit significant degradation under corruption and missing-modality scenarios, with some methods further compromising model trustworthiness.

다중모드 도메인 일반화에서 진전이 있는가: 포괄적 벤치마크 연구

Are We Making Progress in Multimodal Domain Generalization? A Comprehensive Benchmark Study

초록

Support