AdaTooler-V:面向图像与视频的自适应工具调用系统
AdaTooler-V: Adaptive Tool-Use for Images and Videos
December 18, 2025
作者: Chaoyang Wang, Kaituo Feng, Dongyang Chen, Zhongyu Wang, Zhixun Li, Sicheng Gao, Meng Meng, Xu Zhou, Manyuan Zhang, Yuzhang Shang, Xiangyu Yue
cs.AI
摘要
最新研究表明,多模态大语言模型(MLLMs)能够通过结合视觉工具交互的多模态交织思维链(CoT)获得性能提升。然而,现有开源模型常表现出盲目的工具使用推理模式,即使在无需视觉工具的情况下也会调用,这不仅显著增加推理开销,还会降低模型性能。为此,我们提出AdaTooler-V——一种能通过判断视觉问题是否真正需要工具来实现自适应工具调用的MLLM。首先,我们引入AT-GRPO强化学习算法,该算法根据每个样本的"工具效益评分"自适应调整奖励尺度,激励模型仅在工具能带来实质改进时调用。此外,我们构建了两个训练支持数据集:包含10万样本的AdaTooler-V-CoT-100k用于监督微调冷启动,以及30万样本的AdaTooler-V-300k用于在单图像、多图像和视频数据上实现可验证奖励的强化学习。在十二个基准测试上的实验表明,AdaTooler-V具备强大的推理能力,在多样化视觉推理任务中超越现有方法。值得注意的是,AdaTooler-V-7B在高分辨率基准V*上达到89.8%的准确率,超越了商用闭源模型GPT-4o和Gemini 1.5 Pro。所有代码、模型及数据均已开源。
English
Recent advances have shown that multimodal large language models (MLLMs) benefit from multimodal interleaved chain-of-thought (CoT) with vision tool interactions. However, existing open-source models often exhibit blind tool-use reasoning patterns, invoking vision tools even when they are unnecessary, which significantly increases inference overhead and degrades model performance. To this end, we propose AdaTooler-V, an MLLM that performs adaptive tool-use by determining whether a visual problem truly requires tools. First, we introduce AT-GRPO, a reinforcement learning algorithm that adaptively adjusts reward scales based on the Tool Benefit Score of each sample, encouraging the model to invoke tools only when they provide genuine improvements. Moreover, we construct two datasets to support training: AdaTooler-V-CoT-100k for SFT cold start and AdaTooler-V-300k for RL with verifiable rewards across single-image, multi-image, and video data. Experiments across twelve benchmarks demonstrate the strong reasoning capability of AdaTooler-V, outperforming existing methods in diverse visual reasoning tasks. Notably, AdaTooler-V-7B achieves an accuracy of 89.8\% on the high-resolution benchmark V*, surpassing the commercial proprietary model GPT-4o and Gemini 1.5 Pro. All code, models, and data are released.