VideoMind:一個基於LoRA鏈的長視頻推理代理
VideoMind: A Chain-of-LoRA Agent for Long Video Reasoning
March 17, 2025
作者: Ye Liu, Kevin Qinghong Lin, Chang Wen Chen, Mike Zheng Shou
cs.AI
摘要
影片以其獨特的時間維度,要求精確的基於事實的理解,其中答案需直接與可視化、可解釋的證據相連結。儘管大型語言模型在推理能力上取得了重大突破,但多模態推理——尤其是針對影片的推理——仍未被充分探索。在本研究中,我們介紹了VideoMind,這是一種專為時間基底的影片理解而設計的新穎影片-語言代理。VideoMind融合了兩項關鍵創新:(i) 我們識別了影片時間推理的必備能力,並開發了一種基於角色的代理工作流程,包括協調不同角色的規劃器、用於時間定位的定位器、評估時間區間準確性的驗證器,以及負責問答的回答器。(ii) 為了高效整合這些多樣化角色,我們提出了一種新穎的Chain-of-LoRA策略,通過輕量級的LoRA適配器實現無縫角色切換,同時避免了多個模型帶來的開銷,從而平衡了效率與靈活性。在14個公開基準上的廣泛實驗表明,我們的代理在多樣化的影片理解任務中達到了最先進的性能,包括3個基於事實的影片問答、6個影片時間定位,以及5個通用影片問答任務,彰顯了其在推進影片代理與長時序推理方面的有效性。
English
Videos, with their unique temporal dimension, demand precise grounded
understanding, where answers are directly linked to visual, interpretable
evidence. Despite significant breakthroughs in reasoning capabilities within
Large Language Models, multi-modal reasoning - especially for videos - remains
unexplored. In this work, we introduce VideoMind, a novel video-language agent
designed for temporal-grounded video understanding. VideoMind incorporates two
key innovations: (i) We identify essential capabilities for video temporal
reasoning and develop a role-based agentic workflow, including a planner for
coordinating different roles, a grounder for temporal localization, a verifier
to assess temporal interval accuracy, and an answerer for question-answering.
(ii) To efficiently integrate these diverse roles, we propose a novel
Chain-of-LoRA strategy, enabling seamless role-switching via lightweight LoRA
adaptors while avoiding the overhead of multiple models, thus balancing
efficiency and flexibility. Extensive experiments on 14 public benchmarks
demonstrate that our agent achieves state-of-the-art performance on diverse
video understanding tasks, including 3 on grounded video question-answering, 6
on video temporal grounding, and 5 on general video question-answering,
underscoring its effectiveness in advancing video agent and long-form temporal
reasoning.Summary
AI-Generated Summary