ChatPaper.aiChatPaper

CRAG-MM:多模态多轮综合检索增强生成基准测试平台

CRAG-MM: Multi-modal Multi-turn Comprehensive RAG Benchmark

October 30, 2025
作者: Jiaqi Wang, Xiao Yang, Kai Sun, Parth Suresh, Sanat Sharma, Adam Czyzewski, Derek Andersen, Surya Appini, Arkav Banerjee, Sajal Choudhary, Shervin Ghasemlou, Ziqiang Guan, Akil Iyer, Haidar Khan, Lingkun Kong, Roy Luo, Tiffany Ma, Zhen Qiao, David Tran, Wenfang Xu, Skyler Yeatman, Chen Zhou, Gunveer Gujral, Yinglong Xia, Shane Moon, Nicolas Scheffer, Nirav Shah, Eun Chang, Yue Liu, Florian Metze, Tammy Stark, Zhaleh Feizollahi, Andrea Jessee, Mangesh Pujari, Ahmed Aly, Babak Damavandi, Rakesh Wanga, Anuj Kumar, Rohit Patel, Wen-tau Yih, Xin Luna Dong
cs.AI

摘要

诸如智能眼镜等可穿戴设备正在改变人们与周围环境的交互方式,使用户能够获取视野内实体的相关信息。多模态检索增强生成(MM-RAG)在支持此类查询中发挥着关键作用,但目前该领域仍缺乏全面的基准测试,尤其针对可穿戴设备场景。为填补这一空白,我们推出CRAG-MM——面向多模态多轮对话的综合RAG基准。该基准包含涵盖13个领域的6,500组(图像、问题、答案)三元组及2,000组基于视觉的多轮对话,其中包含6,200张为模拟可穿戴设备采集画面而设计的以人为中心视角图像。我们精心设计的问题反映了真实场景与挑战,包含五类图像质量问题、六种问题类型、不同实体热度、差异化的信息动态性以及多样化的对话轮次。我们设定了三项任务:单源增强、多源增强和多轮对话——每项任务均配备对应的检索库及支持图像-知识图谱检索与网页检索的API接口。评估结果显示,传统RAG方法在CRAG-MM单轮和多轮问答中的真实性指标仅达32%和43%,而业界前沿解决方案的质量表现相近(32%/45%),表明仍有巨大提升空间。该基准已作为KDD Cup 2025竞赛平台,吸引了约1,000名参赛者和5,000份提交方案,优胜方案将基线性能提升了28%,彰显了其在推动领域发展方面的早期影响力。
English
Wearable devices such as smart glasses are transforming the way people interact with their surroundings, enabling users to seek information regarding entities in their view. Multi-Modal Retrieval-Augmented Generation (MM-RAG) plays a key role in supporting such questions, yet there is still no comprehensive benchmark for this task, especially regarding wearables scenarios. To fill this gap, we present CRAG-MM -- a Comprehensive RAG benchmark for Multi-modal Multi-turn conversations. CRAG-MM contains a diverse set of 6.5K (image, question, answer) triplets and 2K visual-based multi-turn conversations across 13 domains, including 6.2K egocentric images designed to mimic captures from wearable devices. We carefully constructed the questions to reflect real-world scenarios and challenges, including five types of image-quality issues, six question types, varying entity popularity, differing information dynamism, and different conversation turns. We design three tasks: single-source augmentation, multi-source augmentation, and multi-turn conversations -- each paired with an associated retrieval corpus and APIs for both image-KG retrieval and webpage retrieval. Our evaluation shows that straightforward RAG approaches achieve only 32% and 43% truthfulness on CRAG-MM single- and multi-turn QA, respectively, whereas state-of-the-art industry solutions have similar quality (32%/45%), underscoring ample room for improvement. The benchmark has hosted KDD Cup 2025, attracting about 1K participants and 5K submissions, with winning solutions improving baseline performance by 28%, highlighting its early impact on advancing the field.
PDF151December 2, 2025