LongVT:透過原生工具呼叫激勵「長影片思考」機制
LongVT: Incentivizing "Thinking with Long Videos" via Native Tool Calling
November 25, 2025
作者: Zuhao Yang, Sudong Wang, Kaichen Zhang, Keming Wu, Sicong Leng, Yifan Zhang, Chengwei Qin, Shijian Lu, Xingxuan Li, Lidong Bing
cs.AI
摘要
大型多模態模型在結合文本思維鏈進行影片推理方面展現出巨大潛力,但其仍易產生幻覺現象,尤其在處理證據稀疏且時間分散的長影片時更為明顯。受人類理解長影片方式的啟發——先全域瀏覽再審視相關片段獲取細節——我們提出LongVT,這是一個端到端的智能體框架,通過交錯式多模態工具思維鏈實現「長影片思考」。具體而言,我們利用大型多模態模型固有的時間定位能力作為原生影片裁剪工具,可聚焦特定影片片段並重新採樣更細粒度的影格。這種從全域到局部的推理循環持續進行,直到答案基於檢索到的視覺證據為止。
針對長影片推理任務中細粒度問答數據稀缺的問題,我們構建並將開源名為VideoSIAH的數據套件,以支持訓練與評估。具體而言,我們的訓練數據集包含24.79萬個用於工具整合冷啟動監督微調的樣本、1,600個用於智能體強化學習的樣本,以及1.54萬個用於智能體強化微調的樣本。評估基準則包含1,280個通過半自動數據流水線結合人機協同驗證精心構建的問答對。
通過精心設計的三階段訓練策略與大量實證驗證,LongVT在四項具有挑戰性的長影片理解與推理基準測試中均持續超越現有強基線模型。我們的代碼、數據及模型檢查點已開源於:https://github.com/EvolvingLMMs-Lab/LongVT。
English
Large multimodal models (LMMs) have shown great potential for video reasoning with textual Chain-of-Thought. However, they remain vulnerable to hallucinations, especially when processing long-form videos where evidence is sparse and temporally dispersed. Inspired by how humans comprehend long videos - by first skimming globally and then examining relevant clips for details - we introduce LongVT, an end-to-end agentic framework that enables "Thinking with Long Videos" via interleaved Multimodal Chain-of-Tool-Thought. Specifically, we exploit LMMs' inherent temporal grounding ability as a native video cropping tool to zoom in on a specific video clip and resample finer-grained video frames. This global-to-local reasoning loop continues until answers are grounded in retrieved visual evidence. Given the scarcity of fine-grained question-answering (QA) data for the long video reasoning task, we curate and will release a data suite named VideoSIAH to facilitate both training and evaluation. Specifically, our training dataset consists of 247.9K samples for tool-integrated cold-start supervised fine-tuning, 1.6K samples for agentic reinforcement learning, and 15.4K samples for agentic reinforcement fine-tuning, respectively. Our evaluation benchmark consists of 1,280 QA pairs that are carefully curated through a semi-automatic data pipeline with human-in-the-loop validation. With a meticulously designed three-stage training strategy and extensive empirical validation, LongVT consistently outperforms existing strong baselines across four challenging long-video understanding and reasoning benchmarks. Our codes, data, and model checkpoints are publicly available at https://github.com/EvolvingLMMs-Lab/LongVT .