ParaVT: エージェント型ビデオ強化学習における並列ツール使用のためのツール事前分布パラドックスの克服

要旨

強化学習（RL）を用いて大規模マルチモーダルモデル（LMM）を訓練し、動画処理ツール（例：クロッピング）をネイティブに呼び出せるようにすることは、長時間動画理解の有望な手法となっている。しかし、既存のネイティブRL手法ではツール呼び出しを逐次的（すなわち1ターンに1回）に行うため、単一の誤ったクロップが修正されることなく誤差を伝播し、複数ターンにわたるツール呼び出しがコンテキストを汚染し、推論コストがターン数に比例して線形に増加する。本稿では、複数の時間窓クロップを1ターンで並列に実行し、よりクリーンなコンテキストと優れたフォールトトレランスを実現する、初のマルチエージェントエンドツーエンドRL訓練フレームワークであるParaVTを提案する。しかし、標準的なRLをParaVTに適用すると、「ツール事前分布パラドックス」と名付ける障害が明らかになる。これは、ツール探索を可能にする事前訓練済みツール事前分布が、コールドスタート時の構造的フォーマットを不安定化させ、温度サンプリング下でのツールスキップ報酬近道を露呈するという問題である。より弱い事前分布を持つLMMを用いたモデル間比較によりこの主張は裏付けられる。すなわち、フォーマットは安定を保つものの、RLによってツール呼び出しがゼロに誘導され、事前分布の強さがフォーマット崩壊とツール探索の両方を引き起こす共通の要因であることが示される。我々はPARA-GRPO（解析可能性アンカー型・比率ゲート型GRPO）を提案する。これは標準的なRLに、以下の2つの相補的なメカニズムを追加する。（i）崩壊しやすい構造的トークン位置のみに適用されるターゲット形式報酬、および（ii）プロンプトごとのフレーム予算ランダム化により、ツール呼び出しがスキップよりも測定可能な報酬信号をもたらす訓練プロンプトを生成する。6つの長時間動画理解ベンチマークにおいて、ParaVTはQwen3-VLベースラインに対して平均+7.9%の改善を示し、PARA-GRPOにより訓練時のフォーマット準拠度は0.13から0.64に向上した。ツール機能が現代のLMMにますます内在化されるにつれ、RLはその結果生じる事前分布と協調する必要があり、ParaVTはエージェンティックRLのための汎用的な手法を提供する。コード、データ、およびモデル重みは公開されている。

English

Training large multimodal models (LMMs) via reinforcement learning (RL) to natively invoke video-processing tools (e.g., cropping) has become a promising route to long-video understanding. However, existing native-RL methods dispatch tool calls sequentially (i.e., one per turn): a single wrong crop propagates errors without peer correction, multi-turn tool calls corrupt context, and inference cost scales linearly with the number of turns. We introduce ParaVT, the first multi-agent end-to-end RL-trained framework for Parallel Video Tool calling, dispatching multiple time-window crops in a single turn for cleaner context and better fault tolerance. Yet applying standard RL to ParaVT reveals an obstacle we term the Tool Prior Paradox: the pretrained tool priors that enable tool exploration also destabilize cold-started structural format and expose the skip-tool reward shortcut under temperature sampling. A cross-model contrast on a weaker-prior LMM supports this claim: format stays stable but RL elicits zero tool calls, indicating that prior strength is the shared driver of both format collapse and tool exploration. We propose PARA-GRPO (Parseability-Anchored and Ratio-gAted GRPO), which augments standard RL with two complementary mechanisms: (i) a targeted format reward applied only at the structural-token positions most prone to collapse, and (ii) a per-prompt frame-budget randomization that creates training prompts where calling the tool yields a measurable reward signal over skipping it. Across six long-video understanding benchmarks, ParaVT improves over the Qwen3-VL baseline by +7.9% on average, with PARA-GRPO lifting training-time format compliance from 0.13 to 0.64. As tool capabilities become increasingly internalized in modern LMMs, RL must cooperate with the resulting priors, and ParaVT offers a general recipe for agentic RL. Code, data, and model weights are publicly available.