ツール監視型強化学習による視覚的推論

要旨

本論文では、マルチモーダル大規模言語モデルが複雑な視覚的推論タスクを解決するための効果的なツール利用習得方法について検討する。これを実現するため、我々は新規のツール監視強化学習（ToolsRL）フレームワークを提案し、より効果的なツール利用学習のための直接的なツール監視を実現した。対象とするツールは、拡大、回転、反転、点・線描画といった、単純でネイティブかつ解釈性の高い視覚ツール群であり、これらのツール監視データは容易に収集可能である。強化学習カリキュラムを構築し、第一段階では十分に動機付けられたツール固有の報酬のみで最適化し、第二段階ではツール呼び出しを許可した上で精度目標の報酬で学習を行う。これにより、視覚推論タスクを遂行する前にツール呼び出し能力を確立し、異種タスク間の最適化の衝突を回避する。実験により、ツール監視型カリキュラム学習が効率的であり、ToolsRLが複雑な視覚推論タスクに対して強力なツール利用能力を達成できることを示した。

English

In this paper, we investigate the problem of how to effectively master tool-use to solve complex visual reasoning tasks for Multimodal Large Language Models. To achieve that, we propose a novel Tool-supervised Reinforcement Learning (ToolsRL) framework, with direct tool supervision for more effective tool-use learning. We focus on a series of simple, native, and interpretable visual tools, including zoom-in, rotate, flip, and draw point/line, whose tool supervision is easy to collect. A reinforcement learning curriculum is developed, where the first stage is solely optimized by a set of well motivated tool-specific rewards, and the second stage is trained with the accuracy targeted rewards while allowing calling tools. In this way, tool calling capability is mastered before using tools to complete visual reasoning tasks, avoiding the potential optimization conflict among those heterogeneous tasks. Our experiments have shown that the tool-supervised curriculum training is efficient and ToolsRL can achieve strong tool-use capabilities for complex visual reasoning tasks.

ツール監視型強化学習による視覚的推論

Visual Reasoning through Tool-supervised Reinforcement Learning

要旨

Support