ChatPaper.aiChatPaper

工具视域:面向视觉引导与长周期工具运用的智能体框架

ToolScope: An Agentic Framework for Vision-Guided and Long-Horizon Tool Use

October 31, 2025
作者: Mengjie Deng, Guanting Dong, Zhicheng Dou
cs.AI

摘要

近期,大型语言模型(LLMs)通过自主集成外部工具进行协同推理,展现出卓越的问题解决能力。然而,由于多模态信息固有的复杂性和多样性,如何使多模态大语言模型(MLLMs)在推理过程中灵活高效地调用外部工具仍是一个尚未充分探索的挑战。本文提出ToolScope——一种智能体框架,通过引入专用感知工具来统一全局规划与局部多模态感知,以缓解长视野视觉问答任务中的视觉上下文退化问题。该框架包含三大核心组件:全局导航器作为"望远镜"提供高层策略指导;智能体执行器通过集成搜索、代码和感知三类外部工具,以迭代方式增强模型的局部感知能力;响应合成器则负责将推理过程整合为连贯的用户友好型输出。我们在涵盖VQA 2.0、ScienceQA、MAT-Search和MathVista的四个跨领域VQA基准测试中评估ToolScope,其展现出强大的泛化能力,在所有数据集上平均性能提升最高达6.69%。
English
Recently, large language models (LLMs) have demonstrated remarkable problem-solving capabilities by autonomously integrating with external tools for collaborative reasoning. However, due to the inherently complex and diverse nature of multimodal information, enabling multimodal large language models (MLLMs) to flexibly and efficiently utilize external tools during reasoning remains an underexplored challenge. In this work, we introduce ToolScope, an agentic framework designed to unify global planning with local multimodal perception, adopting a specialized Perceive tool to mitigates visual context degradation in long-horizon VQA task. ToolScope comprises three primary components: the Global Navigator, the Agentic Executor, and the Response Synthesizer. The Global Navigator functions as a "telescope", offering high-level strategic guidance. The Agentic Executor operates iteratively to augment MLLM with local perception through the integration of external tools-Search, Code, and Perceive. Finally, the Response Synthesizer consolidates and organizes the reasoning process into a coherent, user-friendly output. We evaluate ToolScope on four VQA benchmarks across diverse domains, including VQA 2.0, ScienceQA, MAT-Search and MathVista. It demonstrates strong generalization capabilities, achieving an average performance improvement of up to +6.69% across all datasets.
PDF222January 19, 2026