InternVideo3: マルチモーダル文脈推論による基盤モデルのエージェント化

要旨

基盤モデルの最近の進歩は、多段階推論やツール使用を含むエージェント的行動へとシフトしている。しかし、オープンソースの取り組みは主にテキスト中心の設定に焦点を当てており、長期的なマルチモーダルタスクは十分に探究されていない。このギャップは、持続的な時間的理解と反復的相互作用を必要とするビデオタスクにおいて顕著である。本稿では、マルチモーダル文脈推論（MCR）を通じてこれらの能力を強化するフレームワークであるInternVideo3を提案する。MCRは、観察、指示、推論、ツール操作、記憶を含む共有・進化する文脈上で、理解を閉ループプロセスとして扱う。これにより、長尺ビデオ理解を証拠の蓄積と検証として位置づける。効率性を確保するため、トークン保存型再パラメータ化によってKVキャッシュ状態を圧縮しつつ、完全なトークンストリームを保持するマルチモーダルマルチヘッド潜在注意機構（M^2LA）を導入する。段階的訓練には、継続事前学習、短尺から長尺への教師ありファインチューニング、ルールベース強化学習、そして方策オン蒸留が含まれる。実験により、InternVideo3はVideo-MME、MLVU、EgoSchemaなどのベンチマークで強力な性能を達成することが示された。さらに、検索ツールを備えたビデオエージェントとしてモデルを具体化し、堅牢な証拠に基づく行動を実証する。これらの結果は、効率的な文脈処理と閉ループ推論が、オープンマルチモーダルモデルを長期的な視覚に基づく行動主体性へ適応させる上で極めて重要であることを示唆している。

English

Recent progress in foundation models has shifted toward agentic behavior involving multi-step reasoning and tool use. However, open-source efforts largely focus on text-dominant settings, leaving long-horizon multimodal tasks underexplored. This gap is evident in video tasks requiring sustained temporal understanding and iterative interaction. We present InternVideo3, a framework enhancing these capabilities via Multimodal Contextual Reasoning (MCR). MCR treats understanding as a closed-loop process over a shared, evolving context containing observations, instructions, reasoning, tool actions, and memory. This frames long-video understanding as evidence accumulation and verification. To ensure efficiency, we introduce Multimodal Multi-head Latent Attention (M^2LA), a token-preserving reparameterization compressing KV-cache states while retaining the full token stream. Our staged training includes continued pretraining, short-to-long supervised fine-tuning, rule-based reinforcement learning, and on-policy distillation. Experiments show InternVideo3 achieves strong performance on benchmarks like Video-MME, MLVU, and EgoSchema. We further instantiate the model as a video agent with retrieval tools, demonstrating robust evidence-grounded behavior. Our results suggest that efficient context handling and closed-loop reasoning are vital for adapting open multimodal models toward long-horizon visually grounded agency.