V-REX:基于问题链的探索性视觉推理基准测试
V-REX: Benchmarking Exploratory Visual Reasoning via Chain-of-Questions
December 12, 2025
作者: Chenrui Fan, Yijun Liang, Shweta Bhardwaj, Kwesi Cobbina, Ming Li, Tianyi Zhou
cs.AI
摘要
尽管当前多数视觉语言模型(VLM)在标准测试中能较好地回答目标明确的直接性问题,但在处理需要多轮视觉空间探索与推理的复杂开放任务时往往表现不佳。这类视觉思维路径不仅能像AI侦探般提供逐层探索与验证,还能对最终答案产生更优的解读。然而,由于中间步骤的探索空间巨大,此类路径的评估一直面临挑战。为弥补这一差距,我们开发了"多步探索视觉推理评估套件(V-REX)",该套件包含需要原生多步探索的挑战性视觉推理任务基准及评估协议。V-REX覆盖跨领域的丰富应用场景,将多步探索式推理转化为问题链(CoQ)框架,并解构VLM的两大核心能力:(1)规划能力:通过选择探索性问题链分解开放任务;(2)执行能力:依次回答预设问题链以收集推导最终答案的信息。通过为每个步骤设计有限选项的问题与答案,V-REX实现了对中间步骤的可靠量化与细粒度分析。通过对前沿专有及开源VLM的评估,我们发现了稳定的规模扩展趋势、规划与执行能力间的显著差异,以及多步探索推理存在的巨大改进空间。
English
While many vision-language models (VLMs) are developed to answer well-defined, straightforward questions with highly specified targets, as in most benchmarks, they often struggle in practice with complex open-ended tasks, which usually require multiple rounds of exploration and reasoning in the visual space. Such visual thinking paths not only provide step-by-step exploration and verification as an AI detective but also produce better interpretations of the final answers. However, these paths are challenging to evaluate due to the large exploration space of intermediate steps. To bridge the gap, we develop an evaluation suite, ``Visual Reasoning with multi-step EXploration (V-REX)'', which is composed of a benchmark of challenging visual reasoning tasks requiring native multi-step exploration and an evaluation protocol. V-REX covers rich application scenarios across diverse domains. V-REX casts the multi-step exploratory reasoning into a Chain-of-Questions (CoQ) and disentangles VLMs' capability to (1) Planning: breaking down an open-ended task by selecting a chain of exploratory questions; and (2) Following: answering curated CoQ sequentially to collect information for deriving the final answer. By curating finite options of questions and answers per step, V-REX achieves a reliable quantitative and fine-grained analysis of the intermediate steps. By assessing SOTA proprietary and open-sourced VLMs, we reveal consistent scaling trends, significant differences between planning and following abilities, and substantial room for improvement in multi-step exploratory reasoning.