ChatPaper.aiChatPaper

TIR-Bench:面向具身图像思维推理的智能体综合基准评测体系

TIR-Bench: A Comprehensive Benchmark for Agentic Thinking-with-Images Reasoning

November 3, 2025
作者: Ming Li, Jike Zhong, Shitian Zhao, Haoquan Zhang, Shaoheng Lin, Yuxiang Lai, Wei Chen, Konstantinos Psounis, Kaipeng Zhang
cs.AI

摘要

视觉推理的前沿正转向类似OpenAI o3的模型,这类模型能够智能创建并操作工具来转化图像以解决问题,这种在思维链中进行的图像思考范式尚未被现有基准全面评估。即便是当前最常用的图像思考基准Visual Search,也仅测试定位与裁剪等基础操作,难以反映复杂动态且依赖工具的推理能力。我们推出TIR-Bench这一综合基准,通过13项多样化任务评估具身化图像思考能力,每项任务均需在思维链中运用创新工具进行图像处理与操控。我们对22个多模态大语言模型(涵盖领先开源/闭源模型及显式工具增强模型)的测试表明:TIR-Bench具有普适挑战性,优异表现需以真正的图像思考能力为基础。最后我们通过对比直接微调与具身化微调的试点研究,揭示了训练策略对模型工具运用能力的影响。
English
The frontier of visual reasoning is shifting toward models like OpenAI o3, which can intelligently create and operate tools to transform images for problem-solving, also known as thinking-with-images in chain-of-thought. Yet existing benchmarks fail to fully capture this advanced capability. Even Visual Search, the most common benchmark for current thinking-with-images methods, tests only basic operations such as localization and cropping, offering little insight into more complex, dynamic, and tool-dependent reasoning. We introduce TIR-Bench, a comprehensive benchmark for evaluating agentic thinking-with-images across 13 diverse tasks, each requiring novel tool use for image processing and manipulation in chain-of-thought. We evaluate 22 multimodal large language models (MLLMs), from leading open-sourced and proprietary models to those with explicit tool-use augmentation. Results show that TIR-Bench is universally challenging, and strong performance requires genuine thinking-with-images capabilities. Finally, we present a pilot study comparing direct versus agentic fine-tuning.
PDF151January 19, 2026