ChatPaper.aiChatPaper

VideoSeeker:透過原生智能體工具調用促進實例級影片理解

VideoSeeker: Incentivizing Instance-level Video Understanding via Native Agentic Tool Invocation

May 15, 2026
作者: Yiming Zhao, Yu Zeng, Wenxuan Huang, Zhen Fang, Qing Miao, Qisheng Su, Jiawei Zhao, Jiayin Cai, Lin Chen, Zehui Chen, Yukun Qi, Yao Hu, Xiaolong Jiang, Feng Zhao
cs.AI

摘要

大型视觉语言模型在视频理解领域取得了显著进展,但在需要实例级精确实时定位的任务中仍面临重大挑战。现有方法主要依赖文本提示进行人机交互,但这类提示难以提供精确的空间与时间参考,导致用户体验不佳。此外,当前方法通常将视觉感知与语言推理解耦,使推理过程围绕语言而非视觉内容展开,从而限制了模型主动感知细粒度视觉证据的能力。为解决这些问题,本文提出VideoSeeker——一种基于视觉提示的实例级视频理解新范式。VideoSeeker将智能体推理与实例级视频理解任务无缝融合,使模型能够按需主动感知并检索相关视频片段。我们构建了四阶段全自动数据合成流程,高效生成大规模、高质量的实例级视频数据。通过冷启动监督与强化学习训练,将工具调用与主动感知能力内化至模型中,打造出强大的视频理解模型。实验表明,我们的模型在实例级视频理解任务上相较基线平均提升+13.7%,超越GPT-4o和Gemini-2.5-Pro等强大的闭源模型,同时在通用视频理解基准上展现出有效的迁移能力。相关数据集与代码将公开发布。
English
Large Vision-Language Models (LVLMs) have shown significant progress in video understanding, yet they face substantial challenges in tasks requiring precise spatiotemporal localization at the instance level. Existing methods primarily rely on text prompts for human-model interaction, but these prompts struggle to provide precise spatial and temporal references, resulting in poor user experience. Furthermore, current approaches typically decouple visual perception from language reasoning, centering reasoning around language rather than visual content, which limits the model's ability to proactively perceive fine-grained visual evidence. To address these challenges, we propose VideoSeeker, a novel paradigm for instance-level video understanding through visual prompts. VideoSeeker seamlessly integrates agentic reasoning with instance-level video understanding tasks, enabling the model to proactively perceive and retrieve relevant video segments on demand. We construct a four-stage fully automated data synthesis pipeline to efficiently generate large-scale, high-quality instance-level video data. We internalize tool-calling and proactive perception capabilities into the model via cold-start supervision and RL training, building a powerful video understanding model. Experiments demonstrate that our model achieves an average improvement of +13.7% over baselines on instance-level video understanding tasks, surpassing powerful closed-source models such as GPT-4o and Gemini-2.5-Pro, while also showing effective transferability on general video understanding benchmarks. The relevant datasets and code will be released publicly.