ChatPaper.aiChatPaper

EVA:面向端到端视频智能体的高效强化学习

EVA: Efficient Reinforcement Learning for End-to-End Video Agent

March 24, 2026
作者: Yaolun Zhang, Ruohui Wang, Jiahao Wang, Yepeng Tang, Xuanyu Zheng, Haonan Duan, Hao Lu, Hanming Deng, Lewei Lu
cs.AI

摘要

由于视频的长序列特性包含大量时序依赖和冗余帧,基于多模态大语言模型(MLLM)的视频理解仍面临挑战。现有方法通常将MLLM视为被动识别器,直接处理完整视频或均匀采样帧,缺乏自适应推理能力。近期基于智能体的方法虽引入外部工具,但仍依赖人工设计的工作流程和“感知优先”策略,导致长视频处理效率低下。我们提出EVA——面向端到端视频智能体的高效强化学习框架,通过“总结-规划-行动-反思”的迭代推理实现“规划先于感知”。EVA能自主决策观看内容、时机与方式,实现查询驱动的高效视频理解。为训练此类智能体,我们设计了简洁有效的三阶段学习流程:监督微调(SFT)、卡尼曼-特沃斯基优化(KTO)和广义奖励策略优化(GRPO),衔接监督模仿与强化学习。我们还为每个阶段构建了高质量数据集,支持稳定可复现的训练。在六个视频理解基准测试中,EVA展现出全面能力:相较于通用MLLM基线模型提升6-12%,较现有自适应智能体方法再提升1-3%。代码与模型已开源:https://github.com/wangruohui/EfficientVideoAgent。
English
Video understanding with multimodal large language models (MLLMs) remains challenging due to the long token sequences of videos, which contain extensive temporal dependencies and redundant frames. Existing approaches typically treat MLLMs as passive recognizers, processing entire videos or uniformly sampled frames without adaptive reasoning. Recent agent-based methods introduce external tools, yet still depend on manually designed workflows and perception-first strategies, resulting in inefficiency on long videos. We present EVA, an Efficient Reinforcement Learning framework for End-to-End Video Agent, which enables planning-before-perception through iterative summary-plan-action-reflection reasoning. EVA autonomously decides what to watch, when to watch, and how to watch, achieving query-driven and efficient video understanding. To train such agents, we design a simple yet effective three-stage learning pipeline - comprising supervised fine-tuning (SFT), Kahneman-Tversky Optimization (KTO), and Generalized Reward Policy Optimization (GRPO) - that bridges supervised imitation and reinforcement learning. We further construct high-quality datasets for each stage, supporting stable and reproducible training. We evaluate EVA on six video understanding benchmarks, demonstrating its comprehensive capabilities. Compared with existing baselines, EVA achieves a substantial improvement of 6-12% over general MLLM baselines and a further 1-3% gain over prior adaptive agent methods. Our code and model are available at https://github.com/wangruohui/EfficientVideoAgent.
PDF342March 27, 2026