InternVideo3: 以多模態上下文推理實現基礎模型的代理化
InternVideo3: Agentify Foundation Models with Multimodal Contextual Reasoning
June 10, 2026
作者: Ziang Yan, Sheng Xia, Jiashuo Yu, Yue Wu, Tianxiang Jiang, Songze Li, Kanghui Tian, Yicheng Xu, Yinan He, Kai Chen, Limin Wang, Yu Qiao, Yi Wang
cs.AI
摘要
基礎模型的最新進展已轉向具備多步驟推理與工具使用的代理行為。然而,開源努力主要集中於以文字為主的場景,長程多模態任務仍未充分探索。此差距在需要持續時間理解與迭代互動的影片任務中尤為明顯。我們提出InternVideo3,這是一個透過多模態上下文推理(MCR)強化上述能力的框架。MCR將理解視為一個閉環過程,該過程圍繞一個共享且持續演變的上下文進行,其中包含觀察、指令、推理、工具操作與記憶。這將長影片理解框架化為證據累積與驗證。為確保效率,我們引入多模態多頭潛在注意力(M²LA),這是一種保留標記的重參數化方法,壓縮KV快取狀態同時保留完整標記流。我們的分階段訓練包括持續預訓練、短到長監督微調、基於規則的強化學習,以及在策略蒸餾。實驗結果顯示InternVideo3在Video-MME、MLVU與EgoSchema等基準上表現優異。我們進一步將該模型實例化為配備檢索工具的影片代理,展現出穩健的基於證據的行為。我們的結果表明,高效的上下文處理與閉環推理對於將開放多模態模型適應至長程視覺基礎代理至關重要。
English
Recent progress in foundation models has shifted toward agentic behavior involving multi-step reasoning and tool use. However, open-source efforts largely focus on text-dominant settings, leaving long-horizon multimodal tasks underexplored. This gap is evident in video tasks requiring sustained temporal understanding and iterative interaction. We present InternVideo3, a framework enhancing these capabilities via Multimodal Contextual Reasoning (MCR). MCR treats understanding as a closed-loop process over a shared, evolving context containing observations, instructions, reasoning, tool actions, and memory. This frames long-video understanding as evidence accumulation and verification. To ensure efficiency, we introduce Multimodal Multi-head Latent Attention (M^2LA), a token-preserving reparameterization compressing KV-cache states while retaining the full token stream. Our staged training includes continued pretraining, short-to-long supervised fine-tuning, rule-based reinforcement learning, and on-policy distillation. Experiments show InternVideo3 achieves strong performance on benchmarks like Video-MME, MLVU, and EgoSchema. We further instantiate the model as a video agent with retrieval tools, demonstrating robust evidence-grounded behavior. Our results suggest that efficient context handling and closed-loop reasoning are vital for adapting open multimodal models toward long-horizon visually grounded agency.