InternVideo3: 다중 모달 맥락 추론을 통한 기초 모델의 에이전트화

초록

기초 모델의 최근 발전은 다단계 추론과 도구 사용을 포함하는 에이전트적 행동으로 전환되고 있다. 그러나 오픈소스 노력은 주로 텍스트 중심 환경에 초점을 맞추고 있어 장기적 다중 모달 작업은 충분히 탐구되지 못하고 있다. 이러한 격차는 지속적인 시간적 이해와 반복적 상호작용이 필요한 비디오 작업에서 두드러진다. 우리는 이러한 능력을 다중 모달 맥락 추론(MCR)을 통해 향상시키는 프레임워크인 InternVideo3를 제시한다. MCR은 이해를 관찰, 명령, 추론, 도구 작용, 메모리를 포함하는 공유되고 진화하는 맥락에 대한 폐루프 과정으로 취급한다. 이는 장기 비디오 이해를 증거 축적 및 검증으로 구성한다. 효율성을 보장하기 위해, 토큰 스트림을 유지하면서 KV-캐시 상태를 압축하는 토큰 보존 재매개변수화 기법인 다중 모달 다중 헤드 잠재 주의(M²LA)를 도입한다. 우리의 단계적 훈련은 지속적 사전 훈련, 단기에서 장기로의 지도 미세 조정, 규칙 기반 강화 학습, 온-정책 증류를 포함한다. 실험 결과, InternVideo3는 Video-MME, MLVU, EgoSchema와 같은 벤치마크에서 강력한 성능을 달성한다. 또한 검색 도구를 갖춘 비디오 에이전트로 모델을 구현하여 강력한 증거 기반 행동을 입증한다. 우리의 결과는 효율적인 맥락 처리와 폐루프 추론이 개방형 다중 모달 모델을 장기적 시각 기반 에이전시에 적응시키는 데 필수적임을 시사한다.

English

Recent progress in foundation models has shifted toward agentic behavior involving multi-step reasoning and tool use. However, open-source efforts largely focus on text-dominant settings, leaving long-horizon multimodal tasks underexplored. This gap is evident in video tasks requiring sustained temporal understanding and iterative interaction. We present InternVideo3, a framework enhancing these capabilities via Multimodal Contextual Reasoning (MCR). MCR treats understanding as a closed-loop process over a shared, evolving context containing observations, instructions, reasoning, tool actions, and memory. This frames long-video understanding as evidence accumulation and verification. To ensure efficiency, we introduce Multimodal Multi-head Latent Attention (M^2LA), a token-preserving reparameterization compressing KV-cache states while retaining the full token stream. Our staged training includes continued pretraining, short-to-long supervised fine-tuning, rule-based reinforcement learning, and on-policy distillation. Experiments show InternVideo3 achieves strong performance on benchmarks like Video-MME, MLVU, and EgoSchema. We further instantiate the model as a video agent with retrieval tools, demonstrating robust evidence-grounded behavior. Our results suggest that efficient context handling and closed-loop reasoning are vital for adapting open multimodal models toward long-horizon visually grounded agency.