OpenThinkIMG:通過視覺工具強化學習實現圖像思維訓練
OpenThinkIMG: Learning to Think with Images via Visual Tool Reinforcement Learning
May 13, 2025
作者: Zhaochen Su, Linjie Li, Mingyang Song, Yunzhuo Hao, Zhengyuan Yang, Jun Zhang, Guanjie Chen, Jiawei Gu, Juntao Li, Xiaoye Qu, Yu Cheng
cs.AI
摘要
尽管人类能够灵活运用交互式视觉认知解决复杂问题,但让大型视觉语言模型(LVLMs)学习类似的自适应行为以利用视觉工具仍面临挑战。当前缺乏标准化基础设施是一个重大障碍,这阻碍了整合多样化工具、生成丰富的交互数据以及有效训练鲁棒智能体。为填补这些空白,我们推出了OpenThinkIMG,这是首个开源、全面的端到端框架,专为工具增强型LVLMs设计。该框架具备标准化的视觉工具接口、可扩展的策略初始化轨迹生成机制,以及灵活的训练环境。此外,考虑到在静态演示上进行监督微调(SFT)对动态工具调用的策略泛化能力有限,我们提出了一种新颖的强化学习(RL)框架V-ToolRL,用于训练LVLMs学习调用外部视觉工具的自适应策略。V-ToolRL通过直接优化任务成功率,利用工具交互反馈,使LVLMs能够自主发现最佳工具使用策略。我们在具有挑战性的图表推理任务上对V-ToolRL进行了实证验证。基于Qwen2-VL-2B构建的RL训练智能体,其表现显著优于SFT初始化的对照模型(提升28.83分),并平均超越如Taco和CogCom等已建立的监督式工具学习基线12.7分。尤为突出的是,它还以8.68个准确率点的优势超越了如GPT-4.1等知名闭源模型。我们希望OpenThinkIMG能成为推动动态、工具增强型视觉推理的基础框架,助力社区开发真正能够“用图像思考”的AI智能体。
English
While humans can flexibly leverage interactive visual cognition for complex
problem-solving, enabling Large Vision-Language Models (LVLMs) to learn
similarly adaptive behaviors with visual tools remains challenging. A
significant hurdle is the current lack of standardized infrastructure, which
hinders integrating diverse tools, generating rich interaction data, and
training robust agents effectively. To address these gaps, we introduce
OpenThinkIMG, the first open-source, comprehensive end-to-end framework for
tool-augmented LVLMs. It features standardized vision tool interfaces, scalable
trajectory generation for policy initialization, and a flexible training
environment. Furthermore, considering supervised fine-tuning (SFT) on static
demonstrations offers limited policy generalization for dynamic tool
invocation, we propose a novel reinforcement learning (RL) framework V-ToolRL
to train LVLMs to learn adaptive policies for invoking external vision tools.
V-ToolRL enables LVLMs to autonomously discover optimal tool-usage strategies
by directly optimizing for task success using feedback from tool interactions.
We empirically validate V-ToolRL on challenging chart reasoning tasks. Our
RL-trained agent, built upon a Qwen2-VL-2B, significantly outperforms its
SFT-initialized counterpart (+28.83 points) and surpasses established
supervised tool-learning baselines like Taco and CogCom by an average of +12.7
points. Notably, it also surpasses prominent closed-source models like GPT-4.1
by +8.68 accuracy points. We hope OpenThinkIMG can serve as a foundational
framework for advancing dynamic, tool-augmented visual reasoning, helping the
community develop AI agents that can genuinely "think with images".Summary
AI-Generated Summary