MM-CondChain:面向视觉基础深度组合推理的程序化验证基准
MM-CondChain: A Programmatically Verified Benchmark for Visually Grounded Deep Compositional Reasoning
March 12, 2026
作者: Haozhan Shen, Shilin Yan, Hongwei Xue, Shuaiqi Lu, Xiaojun Tang, Guannan Zhang, Tiancheng Zhao, Jianwei Yin
cs.AI
摘要
多模态大语言模型(MLLMs)正日益广泛地应用于执行视觉工作流(如图形用户界面导航),其中后续步骤依赖于经过验证的视觉组合条件(例如"若出现权限对话框且界面颜色为绿色,则点击允许"),且流程可能分叉或提前终止。然而这种能力仍缺乏系统评估:现有基准测试主要关注浅层组合或独立约束,而非深度链式组合条件。本文提出MM-CondChain基准,用于评估基于视觉的深度组合推理能力。每个基准实例均组织为多层推理链,每层包含基于视觉证据的非平凡组合条件,这些条件由多个对象、属性或关系构建而成。要正确回答问题,MLLM必须细致感知图像内容,在每一步对多个视觉元素进行推理,并沿着生成的执行路径推导最终结果。为规模化构建此类工作流式数据,我们提出智能合成流程:规划器(Planner)协调逐层生成组合条件,可验证程序化中间表示(VPIR)确保每层条件可被机械验证,合成器(Composer)则将验证后的层级组装为完整指令。通过该流程,我们在自然图像、数据图表和GUI轨迹三大视觉领域构建了基准测试。对多种MLLM的实验表明,即使最强模型也仅达到53.33%的路径F1值,且在困难负例及深度或谓词复杂度增加时性能急剧下降,证实深度组合推理仍是根本性挑战。
English
Multimodal Large Language Models (MLLMs) are increasingly used to carry out visual workflows such as navigating GUIs, where the next step depends on verified visual compositional conditions (e.g., "if a permission dialog appears and the color of the interface is green, click Allow") and the process may branch or terminate early. Yet this capability remains under-evaluated: existing benchmarks focus on shallow-compositions or independent-constraints rather than deeply chained compositional conditionals. In this paper, we introduce MM-CondChain, a benchmark for visually grounded deep compositional reasoning. Each benchmark instance is organized as a multi-layer reasoning chain, where every layer contains a non-trivial compositional condition grounded in visual evidence and built from multiple objects, attributes, or relations. To answer correctly, an MLLM must perceive the image in detail, reason over multiple visual elements at each step, and follow the resulting execution path to the final outcome. To scalably construct such workflow-style data, we propose an agentic synthesis pipeline: a Planner orchestrates layer-by-layer generation of compositional conditions, while a Verifiable Programmatic Intermediate Representation (VPIR) ensures each layer's condition is mechanically verifiable. A Composer then assembles these verified layers into complete instructions. Using this pipeline, we construct benchmarks across three visual domains: natural images, data charts, and GUI trajectories. Experiments on a range of MLLMs show that even the strongest model attains only 53.33 Path F1, with sharp drops on hard negatives and as depth or predicate complexity grows, confirming that deep compositional reasoning remains a fundamental challenge.