MM-IQ:在多模態模型中進行類人抽象和推理的基準測試
MM-IQ: Benchmarking Human-Like Abstraction and Reasoning in Multimodal Models
February 2, 2025
作者: Huanqia Cai, Yijun Yang, Winston Hu
cs.AI
摘要
智商測試一直是評估人類認知能力的基礎方法,有意將評估與語言背景、語言能力或特定領域知識分離,以分離抽象和推理的核心能力。然而,目前人工智慧研究缺乏系統基準來量化多模式系統中這些關鍵認知維度。為了填補這一關鍵空白,我們提出了MM-IQ,這是一個全面的評估框架,包括2,710個精心挑選的測試項目,涵蓋8種不同的推理範式。
通過對領先的開源和專有多模式模型進行系統評估,我們的基準測試顯示出顯著的局限性:即使是最先進的架構也僅實現比隨機機會略高的性能(27.49%比25%的基準準確度)。這種顯著的性能差距突顯了目前多模式系統在近似基本人類推理能力方面的不足,強調了需要進行開創性的進展來彌合這一認知差距。
English
IQ testing has served as a foundational methodology for evaluating human
cognitive capabilities, deliberately decoupling assessment from linguistic
background, language proficiency, or domain-specific knowledge to isolate core
competencies in abstraction and reasoning. Yet, artificial intelligence
research currently lacks systematic benchmarks to quantify these critical
cognitive dimensions in multimodal systems. To address this critical gap, we
propose MM-IQ, a comprehensive evaluation framework comprising 2,710
meticulously curated test items spanning 8 distinct reasoning paradigms.
Through systematic evaluation of leading open-source and proprietary
multimodal models, our benchmark reveals striking limitations: even
state-of-the-art architectures achieve only marginally superior performance to
random chance (27.49% vs. 25% baseline accuracy). This substantial performance
chasm highlights the inadequacy of current multimodal systems in approximating
fundamental human reasoning capacities, underscoring the need for
paradigm-shifting advancements to bridge this cognitive divide.Summary
AI-Generated Summary