ChatPaper.aiChatPaper

长视频全模态推理与工具使用的基准框架及智能体系统构建

A Benchmark and Agentic Framework for Omni-Modal Reasoning and Tool Use in Long Videos

December 18, 2025
作者: Mohammed Irfan Kurpath, Jaseel Muhammad Kaithakkodan, Jinxing Zhou, Sahal Shaji Mullappilly, Mohammad Almansoori, Noor Ahsan, Beknur Kalmakhanbet, Sambal Shikhar, Rishabh Lalla, Jean Lahoud, Mariette Awad, Fahad Shahbaz Khan, Salman Khan, Rao Muhammad Anwer, Hisham Cholakkal
cs.AI

摘要

长模态多模态视频理解需要将视觉、语音与环境音频相结合,并进行连贯的长程推理。现有基准测试要么侧重时序长度,要么强调多模态丰富性,但鲜少兼顾二者。尽管部分基准引入了开放式问题和高级评估指标,它们仍主要依赖单一分数制的准确率,从而掩盖了失败模式。我们推出LongShOTBench诊断基准,包含开放式意图驱动问题、单轮及多轮对话,以及需要跨视频/音频/语音进行多模态推理和智能体工具使用的任务。每个测试项均配有参考答案和分级评分标准,实现可解释、可追溯的评估。该基准通过可扩展的人工验证流程生成,确保覆盖范围与可复现性。所有样本均经过人工核验修正。此外,我们提出LongShOTAgent智能体系统,通过预处理、搜索和迭代优化实现长视频分析。在LongShOTBench上,前沿MLLMs表现存在显著差距:Gemini-2.5-Flash达到52.95%,开源模型低于30%,而LongShOTAgent获得44.66%。这些结果凸显了现实场景中长模态视频理解的挑战性。LongShOTBench为评估和改进MLLMs提供了实用可复现的基础框架。所有资源已在GitHub开源:https://github.com/mbzuai-oryx/longshot。
English
Long-form multimodal video understanding requires integrating vision, speech, and ambient audio with coherent long-range reasoning. Existing benchmarks emphasize either temporal length or multimodal richness, but rarely both and while some incorporate open-ended questions and advanced metrics, they mostly rely on single-score accuracy, obscuring failure modes. We introduce LongShOTBench, a diagnostic benchmark with open-ended, intent-driven questions; single- and multi-turn dialogues; and tasks requiring multimodal reasoning and agentic tool use across video, audio, and speech. Each item includes a reference answer and graded rubric for interpretable, and traceable evaluation. LongShOTBench is produced via a scalable, human-validated pipeline to ensure coverage and reproducibility. All samples in our LongShOTBench are human-verified and corrected. Furthermore, we present LongShOTAgent, an agentic system that analyzes long videos via preprocessing, search, and iterative refinement. On LongShOTBench, state-of-the-art MLLMs show large gaps: Gemini-2.5-Flash achieves 52.95%, open-source models remain below 30%, and LongShOTAgent attains 44.66%. These results underscore the difficulty of real-world long-form video understanding. LongShOTBench provides a practical, reproducible foundation for evaluating and improving MLLMs. All resources are available on GitHub: https://github.com/mbzuai-oryx/longshot.
PDF31December 23, 2025