CRAG-MM:多模态多轮综合检索增强生成基准测试平台
CRAG-MM: Multi-modal Multi-turn Comprehensive RAG Benchmark
October 30, 2025
作者: Jiaqi Wang, Xiao Yang, Kai Sun, Parth Suresh, Sanat Sharma, Adam Czyzewski, Derek Andersen, Surya Appini, Arkav Banerjee, Sajal Choudhary, Shervin Ghasemlou, Ziqiang Guan, Akil Iyer, Haidar Khan, Lingkun Kong, Roy Luo, Tiffany Ma, Zhen Qiao, David Tran, Wenfang Xu, Skyler Yeatman, Chen Zhou, Gunveer Gujral, Yinglong Xia, Shane Moon, Nicolas Scheffer, Nirav Shah, Eun Chang, Yue Liu, Florian Metze, Tammy Stark, Zhaleh Feizollahi, Andrea Jessee, Mangesh Pujari, Ahmed Aly, Babak Damavandi, Rakesh Wanga, Anuj Kumar, Rohit Patel, Wen-tau Yih, Xin Luna Dong
cs.AI
摘要
智能眼镜等可穿戴设备正在重塑人机交互方式,使使用者能够实时获取视野内实体的相关信息。多模态检索增强生成(MM-RAG)技术在此类问答任务中发挥着关键作用,但目前该领域仍缺乏针对可穿戴场景的综合性基准测试。为填补这一空白,我们推出CRAG-MM——面向多模态多轮对话的综合RAG基准。该基准涵盖13个领域,包含6.5万组(图像、问题、答案)三元组及2000组视觉多轮对话,其中6.2万张第一视角图像专为模拟可穿戴设备采集场景设计。我们精心构建的问题集呈现真实场景挑战,涵盖五类图像质量问题、六种提问类型、差异化实体热度、动态信息变化及多轮对话深度。基准设置三大任务:单源增强、多源增强和多轮对话,每项任务均配备对应的检索知识库,并提供图像-知识图谱检索与网页检索双接口。评估数据显示,传统RAG方法在单轮和多轮问答中的事实准确率仅达32%和43%,而业界前沿解决方案表现相近(32%/45%),表明技术提升空间巨大。该基准已作为KDD Cup 2025竞赛平台,吸引近千名参赛者提交5000次方案,冠军方案将基线性能提升28%,彰显其对领域发展的前瞻性推动力。
English
Wearable devices such as smart glasses are transforming the way people
interact with their surroundings, enabling users to seek information regarding
entities in their view. Multi-Modal Retrieval-Augmented Generation (MM-RAG)
plays a key role in supporting such questions, yet there is still no
comprehensive benchmark for this task, especially regarding wearables
scenarios. To fill this gap, we present CRAG-MM -- a Comprehensive RAG
benchmark for Multi-modal Multi-turn conversations. CRAG-MM contains a diverse
set of 6.5K (image, question, answer) triplets and 2K visual-based multi-turn
conversations across 13 domains, including 6.2K egocentric images designed to
mimic captures from wearable devices. We carefully constructed the questions to
reflect real-world scenarios and challenges, including five types of
image-quality issues, six question types, varying entity popularity, differing
information dynamism, and different conversation turns. We design three tasks:
single-source augmentation, multi-source augmentation, and multi-turn
conversations -- each paired with an associated retrieval corpus and APIs for
both image-KG retrieval and webpage retrieval. Our evaluation shows that
straightforward RAG approaches achieve only 32% and 43% truthfulness on CRAG-MM
single- and multi-turn QA, respectively, whereas state-of-the-art industry
solutions have similar quality (32%/45%), underscoring ample room for
improvement. The benchmark has hosted KDD Cup 2025, attracting about 1K
participants and 5K submissions, with winning solutions improving baseline
performance by 28%, highlighting its early impact on advancing the field.