工具监督强化学习下的视觉推理

摘要

本文研究如何让多模态大语言模型有效掌握工具使用以解决复杂视觉推理任务。为此，我们提出了一种具有直接工具监督的新型工具监督强化学习框架（ToolsRL），以实现更高效的工具使用学习。我们聚焦于一系列简单、原生且可解释的视觉工具（包括局部放大、旋转、翻转及点线标注），其工具监督信号易于获取。我们设计了强化学习课程方案：第一阶段仅通过一组精心设计的工具专用奖励进行优化，第二阶段则在允许调用工具的同时加入以准确率为目标的奖励进行训练。通过这种方式，模型在使用工具完成视觉推理任务前已掌握工具调用能力，避免了异质任务间可能存在的优化冲突。实验表明，工具监督的课程训练效率显著，ToolsRL在复杂视觉推理任务中展现出强大的工具运用能力。

English

In this paper, we investigate the problem of how to effectively master tool-use to solve complex visual reasoning tasks for Multimodal Large Language Models. To achieve that, we propose a novel Tool-supervised Reinforcement Learning (ToolsRL) framework, with direct tool supervision for more effective tool-use learning. We focus on a series of simple, native, and interpretable visual tools, including zoom-in, rotate, flip, and draw point/line, whose tool supervision is easy to collect. A reinforcement learning curriculum is developed, where the first stage is solely optimized by a set of well motivated tool-specific rewards, and the second stage is trained with the accuracy targeted rewards while allowing calling tools. In this way, tool calling capability is mastered before using tools to complete visual reasoning tasks, avoiding the potential optimization conflict among those heterogeneous tasks. Our experiments have shown that the tool-supervised curriculum training is efficient and ToolsRL can achieve strong tool-use capabilities for complex visual reasoning tasks.

工具监督强化学习下的视觉推理

Visual Reasoning through Tool-supervised Reinforcement Learning

摘要

Support