TIR-Bench:面向智能体图像思维推理的综合基准测试平台
TIR-Bench: A Comprehensive Benchmark for Agentic Thinking-with-Images Reasoning
November 3, 2025
作者: Ming Li, Jike Zhong, Shitian Zhao, Haoquan Zhang, Shaoheng Lin, Yuxiang Lai, Wei Chen, Konstantinos Psounis, Kaipeng Zhang
cs.AI
摘要
视觉推理的前沿正转向如OpenAI o3这类模型,它们能够智能创建并操作工具来转化图像以解决问题,这种在思维链中"以图思考"的能力尚未被现有基准充分衡量。即便当前最常用的视觉搜索基准,也仅测试定位裁剪等基础操作,难以评估更复杂、动态且依赖工具的推理能力。我们推出TIR-Bench这一综合性基准,通过涵盖13类多样化任务来评估具身化的以图思考能力,每个任务都需要在思维链中运用创新工具进行图像处理与编辑。我们对22个多模态大语言模型(从领先开源/商业模型到明确增强工具使用能力的模型)的评估表明:TIR-Bench具有普适挑战性,优异表现需真实以图思考能力支撑。最后我们通过对比实验探索了直接微调与具身微调的效果差异。
English
The frontier of visual reasoning is shifting toward models like OpenAI o3,
which can intelligently create and operate tools to transform images for
problem-solving, also known as thinking-with-images in
chain-of-thought. Yet existing benchmarks fail to fully capture this advanced
capability. Even Visual Search, the most common benchmark for current
thinking-with-images methods, tests only basic operations such as
localization and cropping, offering little insight into more complex, dynamic,
and tool-dependent reasoning. We introduce TIR-Bench, a comprehensive
benchmark for evaluating agentic thinking-with-images across 13 diverse tasks,
each requiring novel tool use for image processing and manipulation in
chain-of-thought. We evaluate 22 multimodal large language models (MLLMs), from
leading open-sourced and proprietary models to those with explicit tool-use
augmentation. Results show that TIR-Bench is universally challenging, and
strong performance requires genuine thinking-with-images capabilities. Finally,
we present a pilot study comparing direct versus agentic fine-tuning.