CRAG-MM：多模态多轮综合检索增强生成基准测试平台

摘要

智能眼镜等可穿戴设备正在重塑人机交互方式，使使用者能够实时获取视野内实体的相关信息。多模态检索增强生成（MM-RAG）技术在此类问答任务中发挥着关键作用，但目前该领域仍缺乏针对可穿戴场景的综合性基准测试。为填补这一空白，我们推出CRAG-MM——面向多模态多轮对话的综合RAG基准。该基准涵盖13个领域，包含6.5万组（图像、问题、答案）三元组及2000组视觉多轮对话，其中6.2万张第一视角图像专为模拟可穿戴设备采集场景设计。我们精心构建的问题集呈现真实场景挑战，涵盖五类图像质量问题、六种提问类型、差异化实体热度、动态信息变化及多轮对话深度。基准设置三大任务：单源增强、多源增强和多轮对话，每项任务均配备对应的检索知识库，并提供图像-知识图谱检索与网页检索双接口。评估数据显示，传统RAG方法在单轮和多轮问答中的事实准确率仅达32%和43%，而业界前沿解决方案表现相近（32%/45%），表明技术提升空间巨大。该基准已作为KDD Cup 2025竞赛平台，吸引近千名参赛者提交5000次方案，冠军方案将基线性能提升28%，彰显其对领域发展的前瞻性推动力。

English

Wearable devices such as smart glasses are transforming the way people interact with their surroundings, enabling users to seek information regarding entities in their view. Multi-Modal Retrieval-Augmented Generation (MM-RAG) plays a key role in supporting such questions, yet there is still no comprehensive benchmark for this task, especially regarding wearables scenarios. To fill this gap, we present CRAG-MM -- a Comprehensive RAG benchmark for Multi-modal Multi-turn conversations. CRAG-MM contains a diverse set of 6.5K (image, question, answer) triplets and 2K visual-based multi-turn conversations across 13 domains, including 6.2K egocentric images designed to mimic captures from wearable devices. We carefully constructed the questions to reflect real-world scenarios and challenges, including five types of image-quality issues, six question types, varying entity popularity, differing information dynamism, and different conversation turns. We design three tasks: single-source augmentation, multi-source augmentation, and multi-turn conversations -- each paired with an associated retrieval corpus and APIs for both image-KG retrieval and webpage retrieval. Our evaluation shows that straightforward RAG approaches achieve only 32% and 43% truthfulness on CRAG-MM single- and multi-turn QA, respectively, whereas state-of-the-art industry solutions have similar quality (32%/45%), underscoring ample room for improvement. The benchmark has hosted KDD Cup 2025, attracting about 1K participants and 5K submissions, with winning solutions improving baseline performance by 28%, highlighting its early impact on advancing the field.

CRAG-MM：多模态多轮综合检索增强生成基准测试平台

CRAG-MM: Multi-modal Multi-turn Comprehensive RAG Benchmark

摘要

Support