InternVideo3: 基于多模态上下文推理的基础模型智能体化
InternVideo3: Agentify Foundation Models with Multimodal Contextual Reasoning
June 10, 2026
作者: Ziang Yan, Sheng Xia, Jiashuo Yu, Yue Wu, Tianxiang Jiang, Songze Li, Kanghui Tian, Yicheng Xu, Yinan He, Kai Chen, Limin Wang, Yu Qiao, Yi Wang
cs.AI
摘要
近期基础模型的研究进展已转向具备多步推理与工具使用能力的智能体行为。然而,开源领域的研究主要聚焦于文本主导场景,长程多模态任务仍鲜有探索。这一差距在需要持续时间理解与迭代交互的视频任务中尤为显著。我们提出InternVideo3框架,通过多模态情境推理(Multimodal Contextual Reasoning, MCR)增强此类能力。MCR将理解过程视为一个闭环系统,其核心是包含观察、指令、推理、工具操作与记忆的动态共享情境。该方法将长视频理解重构为证据积累与验证过程。为保障效率,我们引入多模态多头潜在注意力(Multimodal Multi-head Latent Attention, M^2LA),这是一种保留完整令牌流的令牌保持重参数化方法,可压缩KV缓存状态。分阶段训练方案涵盖持续预训练、短程到长程的有监督微调、基于规则的强化学习以及在线策略蒸馏。实验表明,InternVideo3在Video-MME、MLVU和EgoSchema等基准测试中展现优异性能。我们进一步将该模型实例化为配备检索工具的视频智能体,展现出稳健的证据驱动行为。研究结果表明,高效的情境处理与闭环推理对于推动开放多模态模型适应长程视觉具身行为至关重要。
English
Recent progress in foundation models has shifted toward agentic behavior involving multi-step reasoning and tool use. However, open-source efforts largely focus on text-dominant settings, leaving long-horizon multimodal tasks underexplored. This gap is evident in video tasks requiring sustained temporal understanding and iterative interaction. We present InternVideo3, a framework enhancing these capabilities via Multimodal Contextual Reasoning (MCR). MCR treats understanding as a closed-loop process over a shared, evolving context containing observations, instructions, reasoning, tool actions, and memory. This frames long-video understanding as evidence accumulation and verification. To ensure efficiency, we introduce Multimodal Multi-head Latent Attention (M^2LA), a token-preserving reparameterization compressing KV-cache states while retaining the full token stream. Our staged training includes continued pretraining, short-to-long supervised fine-tuning, rule-based reinforcement learning, and on-policy distillation. Experiments show InternVideo3 achieves strong performance on benchmarks like Video-MME, MLVU, and EgoSchema. We further instantiate the model as a video agent with retrieval tools, demonstrating robust evidence-grounded behavior. Our results suggest that efficient context handling and closed-loop reasoning are vital for adapting open multimodal models toward long-horizon visually grounded agency.