PyVision: 동적 도구를 활용한 에이전트 기반 비전

초록

LLM(Large Language Model)은 점차 에이전트, 즉 계획을 세우고 추론하며 외부 도구를 동적으로 호출할 수 있는 시스템으로 배치되고 있다. 그러나 시각적 추론 분야에서 기존 접근법은 주로 사전 정의된 워크플로우와 정적 도구 세트에 제한되어 있다. 본 보고서에서는 PyVision을 소개한다. PyVision은 MLLM(Multimodal Large Language Model)이 주어진 작업에 맞춰 Python 기반 도구를 자율적으로 생성, 실행, 개선할 수 있도록 하는 인터랙티브 멀티턴 프레임워크로, 유연하고 해석 가능한 문제 해결을 가능하게 한다. 우리는 PyVision에 의해 생성된 도구의 분류 체계를 개발하고 다양한 벤치마크에서의 사용 패턴을 분석한다. 정량적으로, PyVision은 일관된 성능 향상을 달성하며, V*에서 GPT-4.1의 성능을 +7.8% 향상시키고, VLMsAreBlind-mini에서 Claude-4.0-Sonnet의 성능을 +31.1% 향상시켰다. 이러한 결과는 더 넓은 변화를 시사한다: 동적 도구 사용은 모델이 단순히 도구를 사용하는 것을 넘어 도구를 발명할 수 있게 함으로써, 더 에이전트적인 시각적 추론으로 나아가게 한다.

English

LLMs are increasingly deployed as agents, systems capable of planning, reasoning, and dynamically calling external tools. However, in visual reasoning, prior approaches largely remain limited by predefined workflows and static toolsets. In this report, we present PyVision, an interactive, multi-turn framework that enables MLLMs to autonomously generate, execute, and refine Python-based tools tailored to the task at hand, unlocking flexible and interpretable problem-solving. We develop a taxonomy of the tools created by PyVision and analyze their usage across a diverse set of benchmarks. Quantitatively, PyVision achieves consistent performance gains, boosting GPT-4.1 by +7.8% on V* and Claude-4.0-Sonnet by +31.1% on VLMsAreBlind-mini. These results point to a broader shift: dynamic tooling allows models not just to use tools, but to invent them, advancing toward more agentic visual reasoning.

PyVision: 동적 도구를 활용한 에이전트 기반 비전

PyVision: Agentic Vision with Dynamic Tooling

초록

Support