MERIT：基于多条件交错查询的多语言语义检索系统

摘要

语义检索对于现代应用至关重要，但在当前研究中仍显不足。现有数据集局限于单一语言、单一图像或单一检索条件，往往未能充分利用视觉信息的表达能力，这一点在图像被替换为文字描述时性能保持不变的现象中可见一斑。然而，实际检索场景中常涉及交织的多条件查询，包含多幅图像。为此，本文推出了MERIT，首个面向交织多条件语义检索的多语言数据集，包含32万条查询和13.5万件商品，覆盖5种语言及7个不同商品类别。在MERIT上的大量实验揭示了现有模型的局限：仅关注全局语义信息，而忽视了查询中的具体条件要素。因此，我们提出了Coral，一种新颖的微调框架，通过集成嵌入重构以保留细粒度条件要素，并结合对比学习以提取全面的全局语义，从而适配预训练的多模态大语言模型（MLLMs）。实验表明，Coral在MERIT上相较于传统方法实现了45.9%的性能提升，并在8个成熟的检索基准测试中展现了强大的泛化能力。综合而言，我们的贡献——新数据集、对现有方法关键局限的识别及创新的微调框架——为未来交织多条件语义检索的研究奠定了基石。

English

Semantic retrieval is crucial for modern applications yet remains underexplored in current research. Existing datasets are limited to single languages, single images, or singular retrieval conditions, often failing to fully exploit the expressive capacity of visual information as evidenced by maintained performance when images are replaced with captions. However, practical retrieval scenarios frequently involve interleaved multi-condition queries with multiple images. Hence, this paper introduces MERIT, the first multilingual dataset for interleaved multi-condition semantic retrieval, comprising 320,000 queries with 135,000 products in 5 languages, covering 7 distinct product categories. Extensive experiments on MERIT identify existing models's limitation: focusing solely on global semantic information while neglecting specific conditional elements in queries. Consequently, we propose Coral, a novel fine-tuning framework that adapts pre-trained MLLMs by integrating embedding reconstruction to preserve fine-grained conditional elements and contrastive learning to extract comprehensive global semantics. Experiments demonstrate that Coral achieves a 45.9% performance improvement over conventional approaches on MERIT, with strong generalization capabilities validated across 8 established retrieval benchmarks. Collectively, our contributions - a novel dataset, identification of critical limitations in existing approaches, and an innovative fine-tuning framework - establish a foundation for future research in interleaved multi-condition semantic retrieval.