MM-CondChain:面向视觉基础深度组合推理的程序化验证基准
MM-CondChain: A Programmatically Verified Benchmark for Visually Grounded Deep Compositional Reasoning
March 12, 2026
作者: Haozhan Shen, Shilin Yan, Hongwei Xue, Shuaiqi Lu, Xiaojun Tang, Guannan Zhang, Tiancheng Zhao, Jianwei Yin
cs.AI
摘要
多模态大语言模型(MLLMs)正日益广泛地应用于执行视觉工作流(如图形用户界面导航),这类任务中后续步骤取决于经验证的视觉组合条件(例如“若出现权限对话框且界面颜色为绿色,则点击允许”),且流程可能分叉或提前终止。然而该能力仍缺乏系统评估:现有基准测试多聚焦于浅层组合或独立约束,而非深度链式组合条件。本文提出MM-CondChain——一个面向视觉情境深度组合推理的基准测试框架。每个测试实例均组织为多层推理链,各层均包含基于视觉证据的非平凡组合条件,这些条件由多个对象、属性或关系构建而成。要正确作答,MLLM需细致感知图像内容,在每一步对多个视觉元素进行推理,并沿执行路径推导至最终结果。为实现工作流式数据的规模化构建,我们提出智能体合成流水线:规划器(Planner)逐层协调组合条件的生成,而可验证程序化中间表示(VPIR)确保每层条件具备机械可验证性。合成器(Composer)随后将这些验证通过的层级组装为完整指令。基于该流水线,我们在自然图像、数据图表和GUI轨迹三大视觉领域构建了基准测试。对多种MLLM的实验表明,即使最强模型也仅达到53.33%的路径F1值,且在困难负例及深度/谓词复杂度增加时性能急剧下降,证实深度组合推理仍是根本性挑战。
English
Multimodal Large Language Models (MLLMs) are increasingly used to carry out visual workflows such as navigating GUIs, where the next step depends on verified visual compositional conditions (e.g., "if a permission dialog appears and the color of the interface is green, click Allow") and the process may branch or terminate early. Yet this capability remains under-evaluated: existing benchmarks focus on shallow-compositions or independent-constraints rather than deeply chained compositional conditionals. In this paper, we introduce MM-CondChain, a benchmark for visually grounded deep compositional reasoning. Each benchmark instance is organized as a multi-layer reasoning chain, where every layer contains a non-trivial compositional condition grounded in visual evidence and built from multiple objects, attributes, or relations. To answer correctly, an MLLM must perceive the image in detail, reason over multiple visual elements at each step, and follow the resulting execution path to the final outcome. To scalably construct such workflow-style data, we propose an agentic synthesis pipeline: a Planner orchestrates layer-by-layer generation of compositional conditions, while a Verifiable Programmatic Intermediate Representation (VPIR) ensures each layer's condition is mechanically verifiable. A Composer then assembles these verified layers into complete instructions. Using this pipeline, we construct benchmarks across three visual domains: natural images, data charts, and GUI trajectories. Experiments on a range of MLLMs show that even the strongest model attains only 53.33 Path F1, with sharp drops on hard negatives and as depth or predicate complexity grows, confirming that deep compositional reasoning remains a fundamental challenge.