모달리티 큐레이션: 고급 다중모달 정보 검색을 위한 범용 임베딩 구축

초록

멀티모달 정보 검색(MIR)은 데이터 소스의 이질성과 교차 모달 정렬의 복잡성으로 인해 본질적인 어려움에 직면해 있습니다. 기존 연구에서는 특징 공간에서의 모달 간격을 확인했지만, 이러한 문제를 해결하기 위한 체계적인 접근 방식은 아직 탐구되지 않았습니다. 본 연구에서는 데이터 큐레이션과 모달리티 인지 훈련 구성이라는 두 가지 중요한 측면을 통해 이러한 문제를 해결하는 범용 프레임워크인 UNITE를 소개합니다. 우리의 연구는 다양한 시나리오에서 모달리티별 데이터 속성이 다운스트림 작업 성능에 미치는 영향을 처음으로 포괄적으로 분석합니다. 또한, 서로 다른 모달리티의 인스턴스 간 경쟁 관계를 완화하기 위해 모달리티 인지 마스크 대조 학습(MAMCL)을 제안합니다. 우리의 프레임워크는 여러 멀티모달 검색 벤치마크에서 최첨단 성능을 달성하며, 기존 방법들을 상당한 차이로 능가합니다. 광범위한 실험을 통해 전략적인 모달리티 큐레이션과 맞춤형 훈련 프로토콜이 강력한 교차 모달 표현 학습에 필수적임을 입증합니다. 이 연구는 MIR 성능을 향상시킬 뿐만 아니라, 멀티모달 시스템에 대한 향후 연구를 위한 기초적인 청사진을 제공합니다. 우리의 프로젝트는 https://friedrichor.github.io/projects/UNITE에서 확인할 수 있습니다.

English

Multimodal information retrieval (MIR) faces inherent challenges due to the heterogeneity of data sources and the complexity of cross-modal alignment. While previous studies have identified modal gaps in feature spaces, a systematic approach to address these challenges remains unexplored. In this work, we introduce UNITE, a universal framework that tackles these challenges through two critical yet underexplored aspects: data curation and modality-aware training configurations. Our work provides the first comprehensive analysis of how modality-specific data properties influence downstream task performance across diverse scenarios. Moreover, we propose Modal-Aware Masked Contrastive Learning (MAMCL) to mitigate the competitive relationships among the instances of different modalities. Our framework achieves state-of-the-art results on multiple multimodal retrieval benchmarks, outperforming existing methods by notable margins. Through extensive experiments, we demonstrate that strategic modality curation and tailored training protocols are pivotal for robust cross-modal representation learning. This work not only advances MIR performance but also provides a foundational blueprint for future research in multimodal systems. Our project is available at https://friedrichor.github.io/projects/UNITE.

모달리티 큐레이션: 고급 다중모달 정보 검색을 위한 범용 임베딩 구축

Modality Curation: Building Universal Embeddings for Advanced Multimodal Information Retrieval

초록

Support