VideoSeeker: 通过原生代理工具调用激励实例级视频理解
VideoSeeker: Incentivizing Instance-level Video Understanding via Native Agentic Tool Invocation
May 15, 2026
作者: Yiming Zhao, Yu Zeng, Wenxuan Huang, Zhen Fang, Qing Miao, Qisheng Su, Jiawei Zhao, Jiayin Cai, Lin Chen, Zehui Chen, Yukun Qi, Yao Hu, Xiaolong Jiang, Feng Zhao
cs.AI
摘要
大型视觉语言模型(LVLMs)在视频理解领域取得了显著进展,但在需要精确实例级时空定位的任务中仍面临重大挑战。现有方法主要依赖文本提示进行人机交互,但这些提示难以提供精确的空间和时间参考,导致用户体验不佳。此外,当前方法通常将视觉感知与语言推理解耦,使推理以语言而非视觉内容为中心,限制了模型主动感知细粒度视觉证据的能力。为解决这些问题,我们提出VideoSeeker——一种通过视觉提示实现实例级视频理解的新范式。VideoSeeker将智能体推理与实例级视频理解任务无缝融合,使模型能够按需主动感知并检索相关视频片段。我们构建了四阶段全自动数据合成流水线,高效生成大规模、高质量的实例级视频数据。通过冷启动监督和强化学习训练,将工具调用与主动感知能力内化到模型中,打造出强大的视频理解模型。实验表明,我们的模型在实例级视频理解任务上较基线平均提升+13.7%,超越了GPT-4o和Gemini-2.5-Pro等强大的闭源模型,同时在通用视频理解基准上表现出有效的迁移能力。相关数据集和代码将公开发布。
English
Large Vision-Language Models (LVLMs) have shown significant progress in video understanding, yet they face substantial challenges in tasks requiring precise spatiotemporal localization at the instance level. Existing methods primarily rely on text prompts for human-model interaction, but these prompts struggle to provide precise spatial and temporal references, resulting in poor user experience. Furthermore, current approaches typically decouple visual perception from language reasoning, centering reasoning around language rather than visual content, which limits the model's ability to proactively perceive fine-grained visual evidence. To address these challenges, we propose VideoSeeker, a novel paradigm for instance-level video understanding through visual prompts. VideoSeeker seamlessly integrates agentic reasoning with instance-level video understanding tasks, enabling the model to proactively perceive and retrieve relevant video segments on demand. We construct a four-stage fully automated data synthesis pipeline to efficiently generate large-scale, high-quality instance-level video data. We internalize tool-calling and proactive perception capabilities into the model via cold-start supervision and RL training, building a powerful video understanding model. Experiments demonstrate that our model achieves an average improvement of +13.7% over baselines on instance-level video understanding tasks, surpassing powerful closed-source models such as GPT-4o and Gemini-2.5-Pro, while also showing effective transferability on general video understanding benchmarks. The relevant datasets and code will be released publicly.