CRAG-MM: Multimodale Multiturn Uitgebreide RAG-benchmark

Samenvatting

Draagbare apparaten zoals slimme brillen transformeren de manier waarop mensen interageren met hun omgeving, doordat gebruikers informatie kunnen opvragen over entiteiten in hun gezichtsveld. Multi-modale retrieval-augmented generation (MM-RAG) speelt een cruciale rol bij het ondersteunen van dergelijke vragen, maar er bestaat nog steeds geen uitgebreide benchmark voor deze taak, met name voor wearables-scenario's. Om deze leemte op te vullen, presenteren wij CRAG-MM – een uitgebreide RAG-benchmark voor multi-modale, multi-turn gesprekken. CRAG-MM bevat een diverse set van 6,5K (afbeelding, vraag, antwoord)-triplets en 2K visueel gebaseerde multi-turn gesprekken verspreid over 13 domeinen, waaronder 6,2K egocentrische afbeeldingen die zijn ontworpen om opnames van draagbare apparaten na te bootsen. Wij hebben de vragen zorgvuldig geconstrueerd om realistische scenario's en uitdagingen te weerspiegelen, waaronder vijf soorten afbeeldingskwaliteitsproblemen, zes vraagtypen, variërende entiteitenpopulariteit, verschillen in informatie-dynamiek en verschillende gespreksbeurten. Wij ontwerpen drie taken: augmentatie met één bron, augmentatie met meerdere bronnen en multi-turn gesprekken – elk gekoppeld aan een bijbehorend retrieval-corpus en API's voor zowel beeld-KG-retrieval als webpagina-retrieval. Onze evaluatie toont aan dat eenvoudige RAG-benaderingen slechts 32% en 43% waarheidsgetrouwheid behalen op respectievelijk CRAG-MM single-turn en multi-turn QA, terwijl state-of-the-art industriële oplossingen een vergelijkbare kwaliteit (32%/45%) hebben, wat wijst op aanzienlijke ruimte voor verbetering. De benchmark heeft de KDD Cup 2025 gehost, waarbij ongeveer 1K deelnemers en 5K inzendingen werden aangetrokken. Winnende oplossingen verbeterden de baseline-prestaties met 28%, wat de vroege impact op de vooruitgang van het vakgebied onderstreept.

English

Wearable devices such as smart glasses are transforming the way people interact with their surroundings, enabling users to seek information regarding entities in their view. Multi-Modal Retrieval-Augmented Generation (MM-RAG) plays a key role in supporting such questions, yet there is still no comprehensive benchmark for this task, especially regarding wearables scenarios. To fill this gap, we present CRAG-MM -- a Comprehensive RAG benchmark for Multi-modal Multi-turn conversations. CRAG-MM contains a diverse set of 6.5K (image, question, answer) triplets and 2K visual-based multi-turn conversations across 13 domains, including 6.2K egocentric images designed to mimic captures from wearable devices. We carefully constructed the questions to reflect real-world scenarios and challenges, including five types of image-quality issues, six question types, varying entity popularity, differing information dynamism, and different conversation turns. We design three tasks: single-source augmentation, multi-source augmentation, and multi-turn conversations -- each paired with an associated retrieval corpus and APIs for both image-KG retrieval and webpage retrieval. Our evaluation shows that straightforward RAG approaches achieve only 32% and 43% truthfulness on CRAG-MM single- and multi-turn QA, respectively, whereas state-of-the-art industry solutions have similar quality (32%/45%), underscoring ample room for improvement. The benchmark has hosted KDD Cup 2025, attracting about 1K participants and 5K submissions, with winning solutions improving baseline performance by 28%, highlighting its early impact on advancing the field.

CRAG-MM: Multimodale Multiturn Uitgebreide RAG-benchmark

CRAG-MM: Multi-modal Multi-turn Comprehensive RAG Benchmark

Samenvatting

Support