EVA:面向端到端影片智慧體的高效強化學習
EVA: Efficient Reinforcement Learning for End-to-End Video Agent
March 24, 2026
作者: Yaolun Zhang, Ruohui Wang, Jiahao Wang, Yepeng Tang, Xuanyu Zheng, Haonan Duan, Hao Lu, Hanming Deng, Lewei Lu
cs.AI
摘要
基於多模態大型語言模型(MLLM)的影片理解仍面臨挑戰,主因在於影片的標記序列過長,包含大量時間依賴性與冗餘影格。現有方法通常將MLLM視為被動識別器,直接處理完整影片或均勻採樣影格,缺乏自適應推理能力。近期基於智能體的方法雖引入外部工具,但仍依賴人工設計的工作流程與「感知優先」策略,導致長影片處理效率低下。本文提出EVA——面向端到端影片智能體的高效強化學習框架,透過迭代式「摘要-規劃-行動-反思」推理機制實現「規劃先於感知」。EVA能自主決策觀看內容、時機與方式,達成查詢驅動的高效影片理解。為訓練此類智能體,我們設計了簡潔有效的三階段學習流程:監督微調(SFT)、卡尼曼-特沃斯基優化(KTO)與廣義獎勵策略優化(GRPO),銜接監督模仿與強化學習。我們進一步為各階段構建高質量數據集,確保訓練穩定性與可重現性。在六個影片理解基準測試中,EVA展現出全面能力:相較現有基線模型,其性能較通用MLLM基線提升6-12%,較先前自適應智能體方法再提升1-3%。原始碼與模型已開源於:https://github.com/wangruohui/EfficientVideoAgent。
English
Video understanding with multimodal large language models (MLLMs) remains challenging due to the long token sequences of videos, which contain extensive temporal dependencies and redundant frames. Existing approaches typically treat MLLMs as passive recognizers, processing entire videos or uniformly sampled frames without adaptive reasoning. Recent agent-based methods introduce external tools, yet still depend on manually designed workflows and perception-first strategies, resulting in inefficiency on long videos. We present EVA, an Efficient Reinforcement Learning framework for End-to-End Video Agent, which enables planning-before-perception through iterative summary-plan-action-reflection reasoning. EVA autonomously decides what to watch, when to watch, and how to watch, achieving query-driven and efficient video understanding. To train such agents, we design a simple yet effective three-stage learning pipeline - comprising supervised fine-tuning (SFT), Kahneman-Tversky Optimization (KTO), and Generalized Reward Policy Optimization (GRPO) - that bridges supervised imitation and reinforcement learning. We further construct high-quality datasets for each stage, supporting stable and reproducible training. We evaluate EVA on six video understanding benchmarks, demonstrating its comprehensive capabilities. Compared with existing baselines, EVA achieves a substantial improvement of 6-12% over general MLLM baselines and a further 1-3% gain over prior adaptive agent methods. Our code and model are available at https://github.com/wangruohui/EfficientVideoAgent.