ParaVT: 驯服工具先验悖论以实现智能体视频强化学习中的并行工具使用

摘要

通过强化学习（RL）训练大型多模态模型（LMMs）以原生调用视频处理工具（如裁剪）已成为长视频理解领域的一条有前景的路径。然而，现有的原生RL方法按顺序调度工具调用（即每轮仅调用一次）：单次错误裁剪会导致错误传播而无法进行平行校正，多轮工具调用会污染上下文，且推理成本随调用轮次线性增长。我们提出ParaVT——首个多智能体端到端RL训练框架，实现并行视频工具调用：单轮内调度多个时间窗口裁剪，以获得更干净的上下文和更好的容错性。然而，将标准RL应用于ParaVT揭示了一个我们称之为**工具先验悖论**的障碍：预训练工具先验虽能促进工具探索，却会在冷启动时破坏结构化格式的稳定性，并在温度采样下暴露出跳过工具奖励的捷径。一项针对弱先验LMM的跨模型对比验证了这一观点：格式保持稳定，但RL未能激发任何工具调用，表明先验强度既是格式崩溃的共同驱动因素，也是工具探索的驱动因素。我们提出**PARA-GRPO**（基于可解析性锚定与比率门控的GRPO），通过两种互补机制增强标准RL：（i）仅在结构标记位置（最易崩溃处）施加针对性格式奖励；（ii）每提示帧预算随机化，构建训练提示使得调用工具相较于跳过工具能获得可测量的奖励信号。在六个长视频理解基准上，ParaVT相比Qwen3-VL基线平均提升7.9%，其中PARA-GRPO将训练阶段的格式合规率从0.13提升至0.64。随着工具能力日益内化于现代LMMs中，RL必须与由此产生的先验协同运作，而ParaVT为智能体RL提供了一种通用方案。代码、数据和模型权重均已公开。

English

Training large multimodal models (LMMs) via reinforcement learning (RL) to natively invoke video-processing tools (e.g., cropping) has become a promising route to long-video understanding. However, existing native-RL methods dispatch tool calls sequentially (i.e., one per turn): a single wrong crop propagates errors without peer correction, multi-turn tool calls corrupt context, and inference cost scales linearly with the number of turns. We introduce ParaVT, the first multi-agent end-to-end RL-trained framework for Parallel Video Tool calling, dispatching multiple time-window crops in a single turn for cleaner context and better fault tolerance. Yet applying standard RL to ParaVT reveals an obstacle we term the Tool Prior Paradox: the pretrained tool priors that enable tool exploration also destabilize cold-started structural format and expose the skip-tool reward shortcut under temperature sampling. A cross-model contrast on a weaker-prior LMM supports this claim: format stays stable but RL elicits zero tool calls, indicating that prior strength is the shared driver of both format collapse and tool exploration. We propose PARA-GRPO (Parseability-Anchored and Ratio-gAted GRPO), which augments standard RL with two complementary mechanisms: (i) a targeted format reward applied only at the structural-token positions most prone to collapse, and (ii) a per-prompt frame-budget randomization that creates training prompts where calling the tool yields a measurable reward signal over skipping it. Across six long-video understanding benchmarks, ParaVT improves over the Qwen3-VL baseline by +7.9% on average, with PARA-GRPO lifting training-time format compliance from 0.13 to 0.64. As tool capabilities become increasingly internalized in modern LMMs, RL must cooperate with the resulting priors, and ParaVT offers a general recipe for agentic RL. Code, data, and model weights are publicly available.