ParaVT: 에이전트 비디오 강화 학습에서 병렬 도구 사용을 위한 도구 사전 역설 해결

초록

강화학습(RL)을 통해 대규모 멀티모달 모델(LMM)이 기본적으로 비디오 처리 도구(예: 크롭)를 호출하도록 훈련하는 방식은 장시간 비디오 이해를 위한 유망한 경로가 되었다. 그러나 기존의 순수 RL 방식은 도구 호출을 순차적으로(즉, 한 턴에 하나씩) 처리한다. 이때 단일 크롭 오류는 동료 교정 없이 오류를 전파하고, 다중 턴 도구 호출은 컨텍스트를 오염시키며, 추론 비용은 턴 수에 비례하여 증가한다. 우리는 ParaVT를 제안한다. 이는 병렬 비디오 도구 호출을 위한 최초의 다중 에이전트 종단간 RL 훈련 프레임워크로, 단일 턴에 여러 시간 윈도우 크롭을 동시에 호출하여 더 깔끔한 컨텍스트와 더 나은 오류 허용성을 제공한다. 그러나 표준 RL을 ParaVT에 적용하면 '도구 사전 지식 역설(Tool Prior Paradox)'이라는 장애물이 드러난다. 이는 도구 탐색을 가능하게 하는 사전 훈련된 도구 사전 지식이 콜드 스타트된 구조적 형식을 불안정하게 만들고, 온도 샘플링 하에서 도구 생략 보상 지름길을 노출시키는 현상이다. 사전 지식이 약한 LMM에서의 교차 모델 대비 실험은 이러한 주장을 뒷받침한다. 형식은 안정적으로 유지되지만 RL이 도구 호출을 전혀 유도하지 못하며, 이는 사전 지식 강도가 형식 붕괴와 도구 탐색 모두의 공통 원인임을 시사한다. 우리는 PARA-GRPO(Parseability-Anchored and Ratio-gAted GRPO)를 제안한다. 이는 표준 RL에 두 가지 상호 보완적 메커니즘을 추가한다: (i) 붕괴에 가장 취약한 구조적 토큰 위치에만 적용되는 타겟 형식 보상, (ii) 프롬프트별 프레임 예산 무작위화로, 도구 호출 시 생략 대비 측정 가능한 보상 신호를 생성하는 훈련 프롬프트를 만든다. 6개의 장시간 비디오 이해 벤치마크에서 ParaVT는 Qwen3-VL 기준선 대비 평균 +7.9% 성능 향상을 보였으며, PARA-GRPO는 훈련 중 형식 준수율을 0.13에서 0.64로 끌어올렸다. 현대 LMM에서 도구 기능이 점점 내재화됨에 따라 RL은 결과적 사전 지식과 협력해야 하며, ParaVT는 에이전트 RL을 위한 일반적인 레시피를 제공한다. 코드, 데이터 및 모델 가중치는 공개적으로 이용 가능하다.

English

Training large multimodal models (LMMs) via reinforcement learning (RL) to natively invoke video-processing tools (e.g., cropping) has become a promising route to long-video understanding. However, existing native-RL methods dispatch tool calls sequentially (i.e., one per turn): a single wrong crop propagates errors without peer correction, multi-turn tool calls corrupt context, and inference cost scales linearly with the number of turns. We introduce ParaVT, the first multi-agent end-to-end RL-trained framework for Parallel Video Tool calling, dispatching multiple time-window crops in a single turn for cleaner context and better fault tolerance. Yet applying standard RL to ParaVT reveals an obstacle we term the Tool Prior Paradox: the pretrained tool priors that enable tool exploration also destabilize cold-started structural format and expose the skip-tool reward shortcut under temperature sampling. A cross-model contrast on a weaker-prior LMM supports this claim: format stays stable but RL elicits zero tool calls, indicating that prior strength is the shared driver of both format collapse and tool exploration. We propose PARA-GRPO (Parseability-Anchored and Ratio-gAted GRPO), which augments standard RL with two complementary mechanisms: (i) a targeted format reward applied only at the structural-token positions most prone to collapse, and (ii) a per-prompt frame-budget randomization that creates training prompts where calling the tool yields a measurable reward signal over skipping it. Across six long-video understanding benchmarks, ParaVT improves over the Qwen3-VL baseline by +7.9% on average, with PARA-GRPO lifting training-time format compliance from 0.13 to 0.64. As tool capabilities become increasingly internalized in modern LMMs, RL must cooperate with the resulting priors, and ParaVT offers a general recipe for agentic RL. Code, data, and model weights are publicly available.