ChatPaper.aiChatPaper

PyVision:具備動態工具能力的自主視覺系統

PyVision: Agentic Vision with Dynamic Tooling

July 10, 2025
作者: Shitian Zhao, Haoquan Zhang, Shaoheng Lin, Ming Li, Qilong Wu, Kaipeng Zhang, Chen Wei
cs.AI

摘要

大型语言模型(LLMs)正日益被部署为具备规划、推理及动态调用外部工具能力的代理系统。然而,在视觉推理领域,先前的方法大多受限于预定义的工作流程和静态工具集。本报告中,我们介绍了PyVision,一个交互式、多轮次的框架,它使得多模态大型语言模型(MLLMs)能够自主生成、执行并优化针对特定任务定制的基于Python的工具,从而开启了灵活且可解释的问题解决之门。我们构建了由PyVision创建的工具分类体系,并分析了这些工具在多样化基准测试中的应用情况。定量分析显示,PyVision实现了持续的性能提升,在V*基准上将GPT-4.1的性能提高了+7.8%,在VLMsAreBlind-mini基准上使Claude-4.0-Sonnet的性能提升了+31.1%。这些成果预示着一个更广泛的转变:动态工具化不仅让模型能够使用工具,更能发明工具,推动着视觉推理向更具代理性的方向迈进。
English
LLMs are increasingly deployed as agents, systems capable of planning, reasoning, and dynamically calling external tools. However, in visual reasoning, prior approaches largely remain limited by predefined workflows and static toolsets. In this report, we present PyVision, an interactive, multi-turn framework that enables MLLMs to autonomously generate, execute, and refine Python-based tools tailored to the task at hand, unlocking flexible and interpretable problem-solving. We develop a taxonomy of the tools created by PyVision and analyze their usage across a diverse set of benchmarks. Quantitatively, PyVision achieves consistent performance gains, boosting GPT-4.1 by +7.8% on V* and Claude-4.0-Sonnet by +31.1% on VLMsAreBlind-mini. These results point to a broader shift: dynamic tooling allows models not just to use tools, but to invent them, advancing toward more agentic visual reasoning.
PDF221July 11, 2025