비디오 질의응답 작업 효율화를 위한 도구 기반 시공간 추론

초록

비디오 질의응답(VideoQA) 과제는 기초 모델이 동적인 현실 세계 시나리오를 효과적으로 인지, 이해, 추론할 수 있는지 평가하는 중요한 장으로 작용합니다. 그러나 기존의 다중모달 대규모 언어 모델(MLLM)은 복잡하고 추론 집중적인 VideoQA 과제에서 비디오 프레임 내 공간 관계를 모델링하는 동시에 시간적 변화의 인과적 역학을 이해하는 데 어려움을 겪습니다. 본 연구에서는 MLLM의 시공간적 추론 능력을 향상시키고 도구의 양과 다양성 간 조화를 보장하기 위해 포괄적이고 확장 가능한 비디오 툴킷을 MLLM에 장착합니다. 도구 호출 순서를 더 효과적으로 제어하고 도구 체인 단축 문제를 피하기 위해, 우리는 시간적 및 공간적 도구를 전략적으로 스케줄링하여 점진적으로 비디오 내 핵심 영역을 국소화하는 시공간 추론 프레임워크(STAR)를 제안합니다. 우리의 STAR 프레임워크는 경량 도구를 사용해 GPT-4o의 성능을 향상시켜 VideoMME에서 8.2%, LongVideoBench에서 4.6%의 성능 향상을 달성했습니다. 우리가 제안한 비디오 툴킷과 STAR 프레임워크가 자율적이고 지능적인 비디오 분석 어시스턴트 구축을 위한 중요한 진전을 이뤘다고 믿습니다. 코드는 https://github.com/fansunqi/VideoTool에서 공개되어 있습니다.

English

Video Question Answering (VideoQA) task serves as a critical playground for evaluating whether foundation models can effectively perceive, understand, and reason about dynamic real-world scenarios. However, existing Multimodal Large Language Models (MLLMs) struggle with simultaneously modeling spatial relationships within video frames and understanding the causal dynamics of temporal evolution on complex and reasoning-intensive VideoQA task. In this work, we equip MLLM with a comprehensive and extensible Video Toolkit, to enhance MLLM's spatiotemporal reasoning capabilities and ensure the harmony between the quantity and diversity of tools. To better control the tool invocation sequence and avoid toolchain shortcut issues, we propose a Spatiotemporal Reasoning Framework (STAR) that strategically schedules temporal and spatial tools, thereby progressively localizing the key area in the video. Our STAR framework enhances GPT-4o using lightweight tools, achieving an 8.2% gain on VideoMME and 4.6% on LongVideoBench. We believe that our proposed Video Toolkit and STAR framework make an important step towards building autonomous and intelligent video analysis assistants. The code is publicly available at https://github.com/fansunqi/VideoTool.

비디오 질의응답 작업 효율화를 위한 도구 기반 시공간 추론

Tool-Augmented Spatiotemporal Reasoning for Streamlining Video Question Answering Task

초록

Support