视觉代理强化微调
Visual Agentic Reinforcement Fine-Tuning
May 20, 2025
作者: Ziyu Liu, Yuhang Zang, Yushan Zou, Zijian Liang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, Jiaqi Wang
cs.AI
摘要
大型推理模型(如OpenAI的o3)的一个关键趋势是其原生具备使用外部工具的能力,例如利用网络浏览器进行搜索、编写/执行代码以进行图像处理,从而实现基于图像的思考。在开源研究社区中,尽管在纯语言代理能力(如函数调用和工具集成)方面已取得显著进展,但涉及真正基于图像思考的多模态代理能力及其相应基准的开发仍较少被探索。本研究强调了视觉代理强化微调(Visual-ARFT)在赋予大型视觉语言模型(LVLMs)灵活且自适应推理能力方面的有效性。通过Visual-ARFT,开源LVLMs能够浏览网站以获取实时信息更新,并编写代码通过裁剪、旋转等图像处理技术来操作和分析输入图像。我们还提出了一个多模态代理工具基准(MAT),包含两种设置(MAT-Search和MAT-Coding),旨在评估LVLMs的代理搜索和编码能力。实验结果表明,Visual-ARFT在MAT-Coding上比基线高出+18.6% F1 / +13.0% EM,在MAT-Search上高出+10.3% F1 / +8.7% EM,最终超越了GPT-4o。此外,Visual-ARFT在现有的多跳问答基准(如2Wiki和HotpotQA)上实现了+29.3% F1 / +25.9% EM的提升,展示了强大的泛化能力。我们的发现表明,Visual-ARFT为构建稳健且可泛化的多模态代理提供了一条有前景的路径。
English
A key trend in Large Reasoning Models (e.g., OpenAI's o3) is the native
agentic ability to use external tools such as web browsers for searching and
writing/executing code for image manipulation to think with images. In the
open-source research community, while significant progress has been made in
language-only agentic abilities such as function calling and tool integration,
the development of multi-modal agentic capabilities that involve truly thinking
with images, and their corresponding benchmarks, are still less explored. This
work highlights the effectiveness of Visual Agentic Reinforcement Fine-Tuning
(Visual-ARFT) for enabling flexible and adaptive reasoning abilities for Large
Vision-Language Models (LVLMs). With Visual-ARFT, open-source LVLMs gain the
ability to browse websites for real-time information updates and write code to
manipulate and analyze input images through cropping, rotation, and other image
processing techniques. We also present a Multi-modal Agentic Tool Bench (MAT)
with two settings (MAT-Search and MAT-Coding) designed to evaluate LVLMs'
agentic search and coding abilities. Our experimental results demonstrate that
Visual-ARFT outperforms its baseline by +18.6% F1 / +13.0% EM on MAT-Coding and
+10.3% F1 / +8.7% EM on MAT-Search, ultimately surpassing GPT-4o. Visual-ARFT
also achieves +29.3 F1% / +25.9% EM gains on existing multi-hop QA benchmarks
such as 2Wiki and HotpotQA, demonstrating strong generalization capabilities.
Our findings suggest that Visual-ARFT offers a promising path toward building
robust and generalizable multimodal agents.Summary
AI-Generated Summary