ChatPaper.aiChatPaper

AdaptMMBench:面向模式选择与推理过程的自适应多模态推理基准评测

AdaptMMBench: Benchmarking Adaptive Multimodal Reasoning for Mode Selection and Reasoning Process

February 2, 2026
作者: Xintong Zhang, Xiaowen Zhang, Jongrong Wu, Zhi Gao, Shilin Yan, Zhenxin Diao, Kunpeng Gao, Xuanyan Chen, Yuwei Wu, Yunde Jia, Qing Li
cs.AI

摘要

自适应多模态推理已成为视觉语言模型领域的前沿方向,其目标是通过动态调节工具增强的视觉推理与文本推理来提升效能与效率。然而现有评估方法依赖静态难度标签和单一指标,无法捕捉难度随模型能力变化的动态特性,导致难以区分自适应模式选择与通用性能的差异,同时缺乏细粒度的过程分析。本文提出AdaptMMBench这一综合性基准测试,涵盖现实场景、文字识别、图形界面、知识应用和数学推理五大领域,包含直接感知与复杂推理双重任务。该基准采用马修斯相关系数量化评估不同推理模式的选择合理性,通过基于模型能力边界动态识别任务难度,实现对元认知能力的独立衡量。此外,AdaptMMBench支持从关键步骤覆盖度、工具效用和计算效率三个维度进行过程评估。实验表明:自适应模式选择能力虽随模型规模提升,但与最终准确率显著解耦;关键步骤覆盖度与性能表现正相关,而工具效用在各模型架构间仍存在显著波动。
English
Adaptive multimodal reasoning has emerged as a promising frontier in Vision-Language Models (VLMs), aiming to dynamically modulate between tool-augmented visual reasoning and text reasoning to enhance both effectiveness and efficiency. However, existing evaluations rely on static difficulty labels and simplistic metrics, which fail to capture the dynamic nature of difficulty relative to varying model capacities. Consequently, they obscure the distinction between adaptive mode selection and general performance while neglecting fine-grained process analyses. In this paper, we propose AdaptMMBench, a comprehensive benchmark for adaptive multimodal reasoning across five domains: real-world, OCR, GUI, knowledge, and math, encompassing both direct perception and complex reasoning tasks. AdaptMMBench utilizes a Matthews Correlation Coefficient (MCC) metric to evaluate the selection rationality of different reasoning modes, isolating this meta-cognition ability by dynamically identifying task difficulties based on models' capability boundaries. Moreover, AdaptMMBench facilitates multi-dimensional process evaluation across key step coverage, tool effectiveness, and computational efficiency. Our evaluation reveals that while adaptive mode selection scales with model capacity, it notably decouples from final accuracy. Conversely, key step coverage aligns with performance, though tool effectiveness remains highly inconsistent across model architectures.
PDF81February 5, 2026