PyVision：具备动态工具能力的智能视觉代理

摘要

大型语言模型（LLMs）正越来越多地被部署为智能代理，这些系统具备规划、推理以及动态调用外部工具的能力。然而，在视觉推理领域，先前的方法大多受限于预定义的工作流程和静态工具集。本报告中，我们介绍了PyVision，一个交互式、多轮次的框架，它使多模态大型语言模型（MLLMs）能够自主生成、执行并优化针对特定任务定制的基于Python的工具，从而开启灵活且可解释的问题解决途径。我们构建了PyVision所创建工具的分类体系，并分析了这些工具在多样化基准测试中的应用情况。定量结果显示，PyVision实现了持续的性能提升，在V*基准上将GPT-4.1提升了+7.8%，在VLMsAreBlind-mini基准上使Claude-4.0-Sonnet提升了+31.1%。这些成果指向了一个更广泛的转变：动态工具化不仅让模型能够使用工具，更能发明工具，推动着视觉推理向更具代理性的方向迈进。

English

LLMs are increasingly deployed as agents, systems capable of planning, reasoning, and dynamically calling external tools. However, in visual reasoning, prior approaches largely remain limited by predefined workflows and static toolsets. In this report, we present PyVision, an interactive, multi-turn framework that enables MLLMs to autonomously generate, execute, and refine Python-based tools tailored to the task at hand, unlocking flexible and interpretable problem-solving. We develop a taxonomy of the tools created by PyVision and analyze their usage across a diverse set of benchmarks. Quantitatively, PyVision achieves consistent performance gains, boosting GPT-4.1 by +7.8% on V* and Claude-4.0-Sonnet by +31.1% on VLMsAreBlind-mini. These results point to a broader shift: dynamic tooling allows models not just to use tools, but to invent them, advancing toward more agentic visual reasoning.

PyVision：具备动态工具能力的智能视觉代理

PyVision: Agentic Vision with Dynamic Tooling

摘要

Support