ChatPaper.aiChatPaper

工具增强的时空推理:简化视频问答任务的新方法

Tool-Augmented Spatiotemporal Reasoning for Streamlining Video Question Answering Task

December 11, 2025
作者: Sunqi Fan, Jiashuo Cui, Meng-Hao Guo, Shuojin Yang
cs.AI

摘要

视频问答任务作为评估基础模型能否有效感知、理解及推理动态现实场景的关键试验场。然而,现有多模态大语言模型在复杂且需要深度推理的视频问答任务中,难以同时建模视频帧内的空间关系并理解时序演变的因果动态。本研究为多模态大语言模型配备了全面可扩展的视频工具包,通过确保工具数量与多样性的协调性,增强模型的时空推理能力。为更好地控制工具调用序列并避免工具链捷径问题,我们提出时空推理框架,通过策略性调度时空工具逐步定位视频关键区域。该框架使用轻量级工具增强GPT-4o性能,在VideoMME和LongVideoBench基准上分别实现8.2%和4.6%的性能提升。我们相信所提出的视频工具包与时空推理框架为构建自主智能的视频分析助手迈出重要一步。代码已开源于https://github.com/fansunqi/VideoTool。
English
Video Question Answering (VideoQA) task serves as a critical playground for evaluating whether foundation models can effectively perceive, understand, and reason about dynamic real-world scenarios. However, existing Multimodal Large Language Models (MLLMs) struggle with simultaneously modeling spatial relationships within video frames and understanding the causal dynamics of temporal evolution on complex and reasoning-intensive VideoQA task. In this work, we equip MLLM with a comprehensive and extensible Video Toolkit, to enhance MLLM's spatiotemporal reasoning capabilities and ensure the harmony between the quantity and diversity of tools. To better control the tool invocation sequence and avoid toolchain shortcut issues, we propose a Spatiotemporal Reasoning Framework (STAR) that strategically schedules temporal and spatial tools, thereby progressively localizing the key area in the video. Our STAR framework enhances GPT-4o using lightweight tools, achieving an 8.2% gain on VideoMME and 4.6% on LongVideoBench. We believe that our proposed Video Toolkit and STAR framework make an important step towards building autonomous and intelligent video analysis assistants. The code is publicly available at https://github.com/fansunqi/VideoTool.
PDF31December 13, 2025