ChatPaper.aiChatPaper

长视频思维激励:通过原生工具调用实现LongVT

LongVT: Incentivizing "Thinking with Long Videos" via Native Tool Calling

November 25, 2025
作者: Zuhao Yang, Sudong Wang, Kaichen Zhang, Keming Wu, Sicong Leng, Yifan Zhang, Chengwei Qin, Shijian Lu, Xingxuan Li, Lidong Bing
cs.AI

摘要

大型多模态模型在结合文本思维链进行视频推理方面展现出巨大潜力,但在处理长视频时仍易产生幻觉现象,尤其是当证据稀疏且时间分散的情况下。受人类理解长视频方式的启发——先全局浏览再细查相关片段,我们提出了LongVT端到端智能体框架,通过交错式多模态工具链思维实现"长视频思维"。具体而言,我们利用模型固有的时序定位能力作为原生视频裁剪工具,对特定视频片段进行局部放大并重采样更细粒度的视频帧。这种从全局到局部的推理循环将持续进行,直至答案基于检索到的视觉证据。针对长视频推理任务中细粒度问答数据的稀缺性,我们构建并将发布VideoSIAH数据集套件以支持训练与评估。该训练集包含24.79万条工具集成冷启动监督微调样本、1600条智能体强化学习样本及1.54万条智能体强化微调样本。评估基准包含1280个通过半自动数据流水线结合人工校验精心构建的问答对。通过精心设计的三阶段训练策略和大量实证验证,LongVT在四个具有挑战性的长视频理解与推理基准测试中均持续超越现有强基线。相关代码、数据及模型权重已开源于https://github.com/EvolvingLMMs-Lab/LongVT。
English
Large multimodal models (LMMs) have shown great potential for video reasoning with textual Chain-of-Thought. However, they remain vulnerable to hallucinations, especially when processing long-form videos where evidence is sparse and temporally dispersed. Inspired by how humans comprehend long videos - by first skimming globally and then examining relevant clips for details - we introduce LongVT, an end-to-end agentic framework that enables "Thinking with Long Videos" via interleaved Multimodal Chain-of-Tool-Thought. Specifically, we exploit LMMs' inherent temporal grounding ability as a native video cropping tool to zoom in on a specific video clip and resample finer-grained video frames. This global-to-local reasoning loop continues until answers are grounded in retrieved visual evidence. Given the scarcity of fine-grained question-answering (QA) data for the long video reasoning task, we curate and will release a data suite named VideoSIAH to facilitate both training and evaluation. Specifically, our training dataset consists of 247.9K samples for tool-integrated cold-start supervised fine-tuning, 1.6K samples for agentic reinforcement learning, and 15.4K samples for agentic reinforcement fine-tuning, respectively. Our evaluation benchmark consists of 1,280 QA pairs that are carefully curated through a semi-automatic data pipeline with human-in-the-loop validation. With a meticulously designed three-stage training strategy and extensive empirical validation, LongVT consistently outperforms existing strong baselines across four challenging long-video understanding and reasoning benchmarks. Our codes, data, and model checkpoints are publicly available at https://github.com/EvolvingLMMs-Lab/LongVT .
PDF1392December 3, 2025