ParaVT：馴服工具先驗悖論以實現基於智能體的影片強化學習中的平行工具使用

摘要

通過強化學習（RL）訓練大型多模態模型（LMMs）以原生調用視頻處理工具（如裁剪）已成為實現長視頻理解的一條有前景途徑。然而，現有的原生RL方法按順序分發工具調用（即每輪一次）：一次錯誤裁剪會導致錯誤傳播而無同伴糾正，多輪工具調用污染上下文，且推理成本隨輪數線性增長。我們提出ParaVT，首個基於多智能體端到端RL訓練的平行視頻工具調用框架，在單輪中分發多個時間窗口裁剪，以實現更乾淨的上下文與更佳的容錯性。然而，將標準RL應用於ParaVT揭示了一個我們稱之為「工具先驗悖論」的障礙：使工具探索得以進行的預訓練工具先驗，同時也破壞了冷啟動的結構化格式，並在溫度取樣下暴露了跳過工具的獎勵捷徑。對一個先驗較弱的LMM進行跨模型對比支持了這一論點：格式保持穩定，但RL引發零次工具調用，表明先驗強度是格式崩潰與工具探索的共同驅動因素。我們提出PARA-GRPO（可解析性錨定與比率門控GRPO），通過兩種互補機制增強標準RL：(i) 僅作用於最易崩潰的結構標記位置的有針對性格式獎勵，以及(ii) 每個提示的幀預算隨機化，創建使調用工具相比跳過工具能產生可測量獎勵信號的訓練提示。在六個長視頻理解基準測試中，ParaVT相較於Qwen3-VL基線平均提升+7.9%，而PARA-GRPO將訓練期間的格式合規性從0.13提升至0.64。隨著工具能力逐漸內化於現代LMM中，RL必須與由此產生的先驗協作，而ParaVT為智能體RL提供了一個通用方案。代碼、數據和模型權重均已公開。

English

Training large multimodal models (LMMs) via reinforcement learning (RL) to natively invoke video-processing tools (e.g., cropping) has become a promising route to long-video understanding. However, existing native-RL methods dispatch tool calls sequentially (i.e., one per turn): a single wrong crop propagates errors without peer correction, multi-turn tool calls corrupt context, and inference cost scales linearly with the number of turns. We introduce ParaVT, the first multi-agent end-to-end RL-trained framework for Parallel Video Tool calling, dispatching multiple time-window crops in a single turn for cleaner context and better fault tolerance. Yet applying standard RL to ParaVT reveals an obstacle we term the Tool Prior Paradox: the pretrained tool priors that enable tool exploration also destabilize cold-started structural format and expose the skip-tool reward shortcut under temperature sampling. A cross-model contrast on a weaker-prior LMM supports this claim: format stays stable but RL elicits zero tool calls, indicating that prior strength is the shared driver of both format collapse and tool exploration. We propose PARA-GRPO (Parseability-Anchored and Ratio-gAted GRPO), which augments standard RL with two complementary mechanisms: (i) a targeted format reward applied only at the structural-token positions most prone to collapse, and (ii) a per-prompt frame-budget randomization that creates training prompts where calling the tool yields a measurable reward signal over skipping it. Across six long-video understanding benchmarks, ParaVT improves over the Qwen3-VL baseline by +7.9% on average, with PARA-GRPO lifting training-time format compliance from 0.13 to 0.64. As tool capabilities become increasingly internalized in modern LMMs, RL must cooperate with the resulting priors, and ParaVT offers a general recipe for agentic RL. Code, data, and model weights are publicly available.