VideoSeeker: ネイティブなエージェントツール呼び出しによるインスタンスレベルの動画理解の促進

要旨

大規模視覚言語モデル（LVLM）は動画理解において顕著な進歩を示してきたが、インスタンスレベルの精密な時空間定位を必要とするタスクでは依然として大きな課題に直面している。既存の手法は主にテキストプロンプトによる人間-モデル間の相互作用に依存しているが、これらのプロンプトは正確な空間的・時間的参照を提供することが難しく、ユーザ体験の低下を招いている。さらに、現在のアプローチは通常、視覚的知覚と言語的推論を分離し、言語を中心とした推論を視覚的内容よりも優先させており、モデルが能動的に細かい視覚的証拠を知覚する能力を制限している。これらの課題に対処するため、我々はビジュアルプロンプトを用いたインスタンスレベルの動画理解のための新しいパラダイムであるVideoSeekerを提案する。VideoSeekerはエージェント的推論をインスタンスレベルの動画理解タスクとシームレスに統合し、モデルが能動的に必要な動画セグメントを知覚・検索できるようにする。我々は4段階からなる完全自動データ合成パイプラインを構築し、大規模で高品質なインスタンスレベルの動画データを効率的に生成する。コールドスタート教師信号とRL訓練を通じて、ツール呼び出しと能動的な知覚能力をモデルに内在化させ、強力な動画理解モデルを構築する。実験により、本モデルはインスタンスレベルの動画理解タスクにおいてベースラインと比較して平均+13.7%の改善を達成し、GPT-4oやGemini-2.5-Proなどの強力なクローズドソースモデルを凌駕するとともに、一般的な動画理解ベンチマークにおいても効果的な転移可能性を示すことが実証された。関連するデータセットとコードは公開される予定である。

English

Large Vision-Language Models (LVLMs) have shown significant progress in video understanding, yet they face substantial challenges in tasks requiring precise spatiotemporal localization at the instance level. Existing methods primarily rely on text prompts for human-model interaction, but these prompts struggle to provide precise spatial and temporal references, resulting in poor user experience. Furthermore, current approaches typically decouple visual perception from language reasoning, centering reasoning around language rather than visual content, which limits the model's ability to proactively perceive fine-grained visual evidence. To address these challenges, we propose VideoSeeker, a novel paradigm for instance-level video understanding through visual prompts. VideoSeeker seamlessly integrates agentic reasoning with instance-level video understanding tasks, enabling the model to proactively perceive and retrieve relevant video segments on demand. We construct a four-stage fully automated data synthesis pipeline to efficiently generate large-scale, high-quality instance-level video data. We internalize tool-calling and proactive perception capabilities into the model via cold-start supervision and RL training, building a powerful video understanding model. Experiments demonstrate that our model achieves an average improvement of +13.7% over baselines on instance-level video understanding tasks, surpassing powerful closed-source models such as GPT-4o and Gemini-2.5-Pro, while also showing effective transferability on general video understanding benchmarks. The relevant datasets and code will be released publicly.