MMR-Life: Het samenvoegen van real-life scènes voor multimodale multi-beeldredenering

Samenvatting

Recente vooruitgang in de redeneervaardigheden van multimodale grote taalmmodellen (MLLM's) heeft hen in staat gesteld complexere taken aan te pakken, zoals wetenschappelijke analyse en wiskundig redeneren. Ondanks hun belofte blijven de redeneervaardigheden van MLLM's in verschillende real-life scenario's grotendeels onontgonnen en ontbreekt het aan gestandaardiseerde benchmarks voor evaluatie. Om deze leemte op te vullen, introduceren we MMR-Life, een uitgebreide benchmark die is ontworpen om de diverse multimodale redeneervaardigheden van MLLM's met meerdere afbeeldingen in real-life scenario's te evalueren. MMR-Life bestaat uit 2.646 multiple-choicevragen gebaseerd op 19.108 afbeeldingen, voornamelijk afkomstig uit real-world contexten, en behandelt uitgebreid zeven redeneertypen: abductief, analogisch, causaal, deductief, inductief, ruimtelijk en temporeel. In tegenstelling tot bestaande redeneerbenchmarks, vertrouwt MMR-Life niet op domeinspecifieke expertise, maar vereist het dat modellen informatie over meerdere afbeeldingen integreren en diverse redeneervaardigheden toepassen. De evaluatie van 37 geavanceerde modellen benadrukt de aanzienlijke uitdaging die MMR-Life vormt. Zelfs topmodellen zoals GPT-5 behalen slechts 58% nauwkeurigheid en vertonen aanzienlijke variatie in prestaties tussen de redeneertypen. Bovendien analyseren we de redeneerparadigma's van bestaande MLLM's en onderzoeken we hoe factoren zoals denklengte, redeneermethode en redeneertype hun prestaties beïnvloeden. Samenvattend legt MMR-Life een uitgebreide basis voor het evalueren, analyseren en verbeteren van de volgende generatie multimodale redeneersystemen.

English

Recent progress in the reasoning capabilities of multimodal large language models (MLLMs) has empowered them to address more complex tasks such as scientific analysis and mathematical reasoning. Despite their promise, MLLMs' reasoning abilities across different scenarios in real life remain largely unexplored and lack standardized benchmarks for evaluation. To address this gap, we introduce MMR-Life, a comprehensive benchmark designed to evaluate the diverse multimodal multi-image reasoning capabilities of MLLMs across real-life scenarios. MMR-Life consists of 2,646 multiple-choice questions based on 19,108 images primarily sourced from real-world contexts, comprehensively covering seven reasoning types: abductive, analogical, causal, deductive, inductive, spatial, and temporal. Unlike existing reasoning benchmarks, MMR-Life does not rely on domain-specific expertise but instead requires models to integrate information across multiple images and apply diverse reasoning abilities. The evaluation of 37 advanced models highlights the substantial challenge posed by MMR-Life. Even top models like GPT-5 achieve only 58% accuracy and display considerable variance in performance across reasoning types. Moreover, we analyze the reasoning paradigms of existing MLLMs, exploring how factors such as thinking length, reasoning method, and reasoning type affect their performance. In summary, MMR-Life establishes a comprehensive foundation for evaluating, analyzing, and improving the next generation of multimodal reasoning systems.

MMR-Life: Het samenvoegen van real-life scènes voor multimodale multi-beeldredenering

MMR-Life: Piecing Together Real-life Scenes for Multimodal Multi-image Reasoning

Samenvatting

Support