工具增强的时空推理：优化视频问答任务的流程

摘要

视频问答任务作为评估基础模型能否有效感知、理解并推理动态现实场景的关键试验场。然而，现有多模态大语言模型在复杂且需要强推理能力的视频问答任务中，难以同时建模视频帧内的空间关系并理解时序演变的因果动态。本研究为多模态大语言模型配备了一套全面可扩展的视频工具包，以增强其时空推理能力，并确保工具数量与多样性的协调统一。为更好地控制工具调用顺序并避免工具链捷径问题，我们提出时空推理框架，通过策略性调度时序与空间工具，逐步定位视频中的关键区域。该框架基于轻量级工具增强GPT-4o模型，在VideoMME和LongVideoBench基准上分别实现8.2%和4.6%的性能提升。我们相信，所提出的视频工具包与时空推理框架为构建自主智能的视频分析助手迈出了重要一步。代码已开源于https://github.com/fansunqi/VideoTool。

English

Video Question Answering (VideoQA) task serves as a critical playground for evaluating whether foundation models can effectively perceive, understand, and reason about dynamic real-world scenarios. However, existing Multimodal Large Language Models (MLLMs) struggle with simultaneously modeling spatial relationships within video frames and understanding the causal dynamics of temporal evolution on complex and reasoning-intensive VideoQA task. In this work, we equip MLLM with a comprehensive and extensible Video Toolkit, to enhance MLLM's spatiotemporal reasoning capabilities and ensure the harmony between the quantity and diversity of tools. To better control the tool invocation sequence and avoid toolchain shortcut issues, we propose a Spatiotemporal Reasoning Framework (STAR) that strategically schedules temporal and spatial tools, thereby progressively localizing the key area in the video. Our STAR framework enhances GPT-4o using lightweight tools, achieving an 8.2% gain on VideoMME and 4.6% on LongVideoBench. We believe that our proposed Video Toolkit and STAR framework make an important step towards building autonomous and intelligent video analysis assistants. The code is publicly available at https://github.com/fansunqi/VideoTool.

工具增强的时空推理：优化视频问答任务的流程

Tool-Augmented Spatiotemporal Reasoning for Streamlining Video Question Answering Task

摘要

Support