VTC-Bench：合成的視覚的ツール連鎖によるエージェント型マルチモーダルモデルの評価

要旨

近年、マルチモーダル大規模言語モデル（MLLM）の進展は、標準的な視覚的質問応答を超え、外部ツールを活用した高度な視覚タスクへの応用が進んでいる。しかしながら、複雑なタスクにおいて多様なツールを正確に実行し、効果的に組み合わせる能力は、依然として大きな課題として残されている。既存のベンチマークは、限られたツールセットと単純なツール使用軌跡に制約されており、複雑で多様なツール間の相互作用を捉えられず、実践的な現実世界の条件下でのモデル性能を評価するには不十分である。この隔たりを埋めるため、我々はMLLMのツール使用能力を評価する包括的ベンチマークであるVisualToolChain-Bench（VTC-Bench）を提案する。現実的なコンピュータビジョンパイプラインに合わせるため、本フレームワークは32種類の多様なOpenCVベースの視覚操作を特徴とする。この豊富なツールセットにより広範な組み合わせが可能となり、VTC-Benchは多ツールの構成と、長期的で多段階の計画実行を厳密に評価できる。正確な評価のため、9カテゴリの認知的階層に構造化された680の精選された問題と、それぞれに対する正解の実行軌跡を提供する。 19の主要なMLLMを用いた大規模な実験により、現在のモデルが持つ視覚的エージェンシー能力の重大な限界が明らかになった。具体的には、モデルは多様なツールセットへの適応や未経験の操作への汎化が困難であり、最高性能モデルであるGemini-3.0-Proでさえ、本ベンチマークにおいて51%の精度しか達成できなかった。さらに、多ツールの構成は持続的な課題である。複雑なタスクに直面した場合、モデルは効率的な実行計画を立案できず、最適なツールを選択するよりも、狭い範囲の慣れ親しんだ機能の一部に過度に依存する傾向があった。これらの根本的な課題を特定することにより、VTC-Benchは、より汎用的な視覚的エージェンシーモデルの開発を導く厳密なベースラインを確立する。

English

Recent advancements extend Multimodal Large Language Models (MLLMs) beyond standard visual question answering to utilizing external tools for advanced visual tasks. Despite this progress, precisely executing and effectively composing diverse tools for complex tasks remain persistent bottleneck. Constrained by sparse tool-sets and simple tool-use trajectories, existing benchmarks fail to capture complex and diverse tool interactions, falling short in evaluating model performance under practical, real-world conditions. To bridge this gap, we introduce VisualToolChain-Bench(VTC-Bench), a comprehensive benchmark designed to evaluate tool-use proficiency in MLLMs. To align with realistic computer vision pipelines, our framework features 32 diverse OpenCV-based visual operations. This rich tool-set enables extensive combinations, allowing VTC-Bench to rigorously assess multi-tool composition and long-horizon, multi-step plan execution. For precise evaluation, we provide 680 curated problems structured across a nine-category cognitive hierarchy, each with ground-truth execution trajectories. Extensive experiments on 19 leading MLLMs reveal critical limitations in current models' visual agentic capabilities. Specifically, models struggle to adapt to diverse tool-sets and generalize to unseen operations, with the leading model Gemini-3.0-Pro only achieving 51% on our benchmark. Furthermore, multi-tool composition remains a persistent challenge. When facing complex tasks, models struggle to formulate efficient execution plans, relying heavily on a narrow, suboptimal subset of familiar functions rather than selecting the optimal tools. By identifying these fundamental challenges, VTC-Bench establishes a rigorous baseline to guide the development of more generalized visual agentic models.

VTC-Bench：合成的視覚的ツール連鎖によるエージェント型マルチモーダルモデルの評価

VTC-Bench: Evaluating Agentic Multimodal Models via Compositional Visual Tool Chaining

要旨

Support