MM-CondChain: 視覚的基盤に基づく深層合成的推論のためのプログラム検証済みベンチマーク

要旨

マルチモーダル大規模言語モデル（MLLM）は、GUI操作などの視覚的ワークフローを実行するためにますます利用されている。このようなワークフローでは、次のステップが検証済みの視覚的構成条件（例：「権限ダイアログが表示され、かつインターフェースの色が緑色の場合、『許可』をクリックする」）に依存し、プロセスが分岐したり早期終了したりする可能性がある。しかし、この能力は十分に評価されていない。既存のベンチマークは、浅い構成や独立した制約に焦点を当てており、深く連鎖した合成的条件を評価するものではない。本論文では、視覚に基づく深い合成的推論のためのベンチマークMM-CondChainを提案する。各ベンチマークインスタンスは多層の推論チェーンとして構成され、各層には、複数のオブジェクト、属性、関係から構築され、視覚的証拠に基づいた非自明な合成的条件が含まれる。正しく答えるためには、MLLMは画像を詳細に知覚し、各ステップで複数の視覚要素について推論し、結果として生じる実行パスを最終結果まで辿らなければならない。このようなワークフロースタイルのデータを拡張性を持って構築するため、我々はエージェント的な合成パイプラインを提案する。Plannerが合成的条件の層ごとの生成を調整し、検証可能なプログラム的中間表現（VPIR）が各層の条件が機械的に検証可能であることを保証する。その後、Composerがこれらの検証済みの層を完全な指示文に組み立てる。このパイプラインを用いて、自然画像、データチャート、GUI軌跡の3つの視覚領域にわたるベンチマークを構築した。様々なMLLMを用いた実験では、最も強力なモデルでもPath F1で53.33%に留まり、困難なネガティブケースや、深度や述語の複雑さが増すにつれて性能が急激に低下することが確認された。これは、深い合成的推論が依然として根本的な課題であることを示している。

English

Multimodal Large Language Models (MLLMs) are increasingly used to carry out visual workflows such as navigating GUIs, where the next step depends on verified visual compositional conditions (e.g., "if a permission dialog appears and the color of the interface is green, click Allow") and the process may branch or terminate early. Yet this capability remains under-evaluated: existing benchmarks focus on shallow-compositions or independent-constraints rather than deeply chained compositional conditionals. In this paper, we introduce MM-CondChain, a benchmark for visually grounded deep compositional reasoning. Each benchmark instance is organized as a multi-layer reasoning chain, where every layer contains a non-trivial compositional condition grounded in visual evidence and built from multiple objects, attributes, or relations. To answer correctly, an MLLM must perceive the image in detail, reason over multiple visual elements at each step, and follow the resulting execution path to the final outcome. To scalably construct such workflow-style data, we propose an agentic synthesis pipeline: a Planner orchestrates layer-by-layer generation of compositional conditions, while a Verifiable Programmatic Intermediate Representation (VPIR) ensures each layer's condition is mechanically verifiable. A Composer then assembles these verified layers into complete instructions. Using this pipeline, we construct benchmarks across three visual domains: natural images, data charts, and GUI trajectories. Experiments on a range of MLLMs show that even the strongest model attains only 53.33 Path F1, with sharp drops on hard negatives and as depth or predicate complexity grows, confirming that deep compositional reasoning remains a fundamental challenge.

MM-CondChain: 視覚的基盤に基づく深層合成的推論のためのプログラム検証済みベンチマーク

MM-CondChain: A Programmatically Verified Benchmark for Visually Grounded Deep Compositional Reasoning

要旨

Support