AdaptMMBench:面向模态选择与推理过程的自适应多模态推理基准评测
AdaptMMBench: Benchmarking Adaptive Multimodal Reasoning for Mode Selection and Reasoning Process
February 2, 2026
作者: Xintong Zhang, Xiaowen Zhang, Jongrong Wu, Zhi Gao, Shilin Yan, Zhenxin Diao, Kunpeng Gao, Xuanyan Chen, Yuwei Wu, Yunde Jia, Qing Li
cs.AI
摘要
自适应多模态推理已成为视觉语言模型领域的前沿方向,其核心在于动态协调工具增强的视觉推理与文本推理,以提升模型效能与效率。然而现有评估方法依赖静态难度标签和单一指标,既无法捕捉任务难度随模型能力变化的动态特性,也难以区分自适应模式选择与通用性能的差异,同时缺乏细粒度的过程分析。本文提出AdaptMMBench综合评估基准,涵盖现实场景、文字识别、图形界面、知识应用与数学推理五大领域,包含直接感知与复杂推理双重任务。该基准采用马修斯相关系数量化不同推理模式的选择合理性,通过基于模型能力边界动态标定任务难度,实现对元认知能力的精准评估。此外,AdaptMMBench支持关键步骤覆盖度、工具效用和计算效率等多维过程评估。实验表明:自适应模式选择能力虽随模型规模提升,但与最终准确率显著解耦;关键步骤覆盖度与性能表现正相关,而工具效用在各模型架构间存在显著差异。
English
Adaptive multimodal reasoning has emerged as a promising frontier in Vision-Language Models (VLMs), aiming to dynamically modulate between tool-augmented visual reasoning and text reasoning to enhance both effectiveness and efficiency. However, existing evaluations rely on static difficulty labels and simplistic metrics, which fail to capture the dynamic nature of difficulty relative to varying model capacities. Consequently, they obscure the distinction between adaptive mode selection and general performance while neglecting fine-grained process analyses. In this paper, we propose AdaptMMBench, a comprehensive benchmark for adaptive multimodal reasoning across five domains: real-world, OCR, GUI, knowledge, and math, encompassing both direct perception and complex reasoning tasks. AdaptMMBench utilizes a Matthews Correlation Coefficient (MCC) metric to evaluate the selection rationality of different reasoning modes, isolating this meta-cognition ability by dynamically identifying task difficulties based on models' capability boundaries. Moreover, AdaptMMBench facilitates multi-dimensional process evaluation across key step coverage, tool effectiveness, and computational efficiency. Our evaluation reveals that while adaptive mode selection scales with model capacity, it notably decouples from final accuracy. Conversely, key step coverage aligns with performance, though tool effectiveness remains highly inconsistent across model architectures.