VideoSeeker: 네이티브 에이전트 도구 호출을 통한 인스턴스 수준 비디오 이해 유도

초록

대규모 시각-언어 모델(LVLM)은 비디오 이해에 있어 상당한 진전을 보였으나, 인스턴스 수준에서 정밀한 시공간적 위치 파악이 필요한 작업에서는 여전히 큰 어려움에 직면해 있다. 기존 방법들은 주로 텍스트 프롬프트를 활용한 인간-모델 상호작용에 의존하지만, 이러한 프롬프트는 정확한 공간적 및 시간적 참조를 제공하기 어려워 사용자 경험이 저하된다. 또한 현재의 접근법은 일반적으로 시각적 인식과 언어 추론을 분리하여, 시각적 콘텐츠가 아닌 언어를 중심으로 추론을 진행함으로써 모델이 세밀한 시각적 증거를 능동적으로 인식하는 능력을 제한한다. 이러한 문제를 해결하기 위해, 우리는 시각적 프롬프트를 통한 인스턴스 수준 비디오 이해를 위한 새로운 패러다임인 VideoSeeker를 제안한다. VideoSeeker는 에이전트 추론과 인스턴스 수준 비디오 이해 작업을 원활하게 통합하여, 모델이 필요에 따라 관련 비디오 구간을 능동적으로 인식하고 검색할 수 있게 한다. 우리는 대규모의 고품질 인스턴스 수준 비디오 데이터를 효율적으로 생성하기 위해 4단계의 완전 자동화된 데이터 합성 파이프라인을 구축하였다. 콜드 스타트 지도학습과 강화학습 훈련을 통해 도구 호출 및 능동적 인식 능력을 모델에 내재화함으로써 강력한 비디오 이해 모델을 구축하였다. 실험 결과, 우리의 모델은 인스턴스 수준 비디오 이해 작업에서 기준 모델 대비 평균 +13.7%의 성능 향상을 달성하였으며, GPT-4o 및 Gemini-2.5-Pro와 같은 강력한 폐쇄형 소스 모델을 능가하면서도 일반 비디오 이해 벤치마크에서 효과적인 전이 가능성을 보여주었다. 관련 데이터셋과 코드는 공개될 예정이다.

English

Large Vision-Language Models (LVLMs) have shown significant progress in video understanding, yet they face substantial challenges in tasks requiring precise spatiotemporal localization at the instance level. Existing methods primarily rely on text prompts for human-model interaction, but these prompts struggle to provide precise spatial and temporal references, resulting in poor user experience. Furthermore, current approaches typically decouple visual perception from language reasoning, centering reasoning around language rather than visual content, which limits the model's ability to proactively perceive fine-grained visual evidence. To address these challenges, we propose VideoSeeker, a novel paradigm for instance-level video understanding through visual prompts. VideoSeeker seamlessly integrates agentic reasoning with instance-level video understanding tasks, enabling the model to proactively perceive and retrieve relevant video segments on demand. We construct a four-stage fully automated data synthesis pipeline to efficiently generate large-scale, high-quality instance-level video data. We internalize tool-calling and proactive perception capabilities into the model via cold-start supervision and RL training, building a powerful video understanding model. Experiments demonstrate that our model achieves an average improvement of +13.7% over baselines on instance-level video understanding tasks, surpassing powerful closed-source models such as GPT-4o and Gemini-2.5-Pro, while also showing effective transferability on general video understanding benchmarks. The relevant datasets and code will be released publicly.