DeepEyesV2: エージェント的なマルチモーダルモデルへ向けて

要旨

エージェント型マルチモーダルモデルは、単にテキストや画像を理解するだけでなく、コード実行環境やWeb検索などの外部ツールを積極的に呼び出し、これらの操作を推論に統合する必要があります。本研究ではDeepEyesV2を紹介し、データ構築、訓練方法、モデル評価の観点から、どのようにエージェント型マルチモーダルモデルを構築するかを探求します。我々は、強化学習のみでは堅牢なツール利用行動を誘導できないことを確認しました。この現象が、ツール利用パターンを確立するコールドスタート段階と、ツール呼び出しをさらに洗練させる強化学習段階からなる二段階訓練パイプラインの動機付けとなりました。我々は多様で中程度に困難な訓練データセットを精選し、特にツール利用が有益となる事例を意図的に含めました。さらに、現実世界のマルチモーダル推論を評価するために設計された包括的ベンチマークRealX-Benchを導入します。これは本質的に、知覚、検索、推論を含む複数の能力の統合を必要とするものです。DeepEyesV2をRealX-Benchおよび他の代表的なベンチマークで評価し、現実世界の理解、数学的推論、検索集約型タスクにおけるその有効性を実証します。さらに、DeepEyesV2はタスク適応型のツール呼び出しを示し、知覚タスクでは画像操作を、推論タスクでは数値計算を利用する傾向があります。強化学習はさらに複雑なツールの組み合わせを可能にし、モデルが文脈に基づいて選択的にツールを呼び出すことを可能にします。我々の研究が、コミュニティにおけるエージェント型マルチモーダルモデルの開発に指針を提供できることを期待します。

English

Agentic multimodal models should not only comprehend text and images, but also actively invoke external tools, such as code execution environments and web search, and integrate these operations into reasoning. In this work, we introduce DeepEyesV2 and explore how to build an agentic multimodal model from the perspectives of data construction, training methods, and model evaluation. We observe that direct reinforcement learning alone fails to induce robust tool-use behavior. This phenomenon motivates a two-stage training pipeline: a cold-start stage to establish tool-use patterns, and reinforcement learning stage to further refine tool invocation. We curate a diverse, moderately challenging training dataset, specifically including examples where tool use is beneficial. We further introduce RealX-Bench, a comprehensive benchmark designed to evaluate real-world multimodal reasoning, which inherently requires the integration of multiple capabilities, including perception, search, and reasoning. We evaluate DeepEyesV2 on RealX-Bench and other representative benchmarks, demonstrating its effectiveness across real-world understanding, mathematical reasoning, and search-intensive tasks. Moreover, DeepEyesV2 exhibits task-adaptive tool invocation, tending to use image operations for perception tasks and numerical computations for reasoning tasks. Reinforcement learning further enables complex tool combinations and allows model to selectively invoke tools based on context. We hope our study can provide guidance for community in developing agentic multimodal models.

DeepEyesV2: エージェント的なマルチモーダルモデルへ向けて

DeepEyesV2: Toward Agentic Multimodal Model

要旨

Support