PyVision:具备动态工具能力的智能视觉代理
PyVision: Agentic Vision with Dynamic Tooling
July 10, 2025
作者: Shitian Zhao, Haoquan Zhang, Shaoheng Lin, Ming Li, Qilong Wu, Kaipeng Zhang, Chen Wei
cs.AI
摘要
大型语言模型(LLMs)正越来越多地被部署为智能代理,这些系统具备规划、推理以及动态调用外部工具的能力。然而,在视觉推理领域,先前的方法大多受限于预定义的工作流程和静态工具集。本报告中,我们介绍了PyVision,一个交互式、多轮次的框架,它使多模态大型语言模型(MLLMs)能够自主生成、执行并优化针对特定任务定制的基于Python的工具,从而开启灵活且可解释的问题解决途径。我们构建了PyVision所创建工具的分类体系,并分析了这些工具在多样化基准测试中的应用情况。定量结果显示,PyVision实现了持续的性能提升,在V*基准上将GPT-4.1提升了+7.8%,在VLMsAreBlind-mini基准上使Claude-4.0-Sonnet提升了+31.1%。这些成果指向了一个更广泛的转变:动态工具化不仅让模型能够使用工具,更能发明工具,推动着视觉推理向更具代理性的方向迈进。
English
LLMs are increasingly deployed as agents, systems capable of planning,
reasoning, and dynamically calling external tools. However, in visual
reasoning, prior approaches largely remain limited by predefined workflows and
static toolsets. In this report, we present PyVision, an interactive,
multi-turn framework that enables MLLMs to autonomously generate, execute, and
refine Python-based tools tailored to the task at hand, unlocking flexible and
interpretable problem-solving. We develop a taxonomy of the tools created by
PyVision and analyze their usage across a diverse set of benchmarks.
Quantitatively, PyVision achieves consistent performance gains, boosting
GPT-4.1 by +7.8% on V* and Claude-4.0-Sonnet by +31.1% on VLMsAreBlind-mini.
These results point to a broader shift: dynamic tooling allows models not just
to use tools, but to invent them, advancing toward more agentic visual
reasoning.