VideoMind:面向长视频推理的LoRA链式智能体
VideoMind: A Chain-of-LoRA Agent for Long Video Reasoning
March 17, 2025
作者: Ye Liu, Kevin Qinghong Lin, Chang Wen Chen, Mike Zheng Shou
cs.AI
摘要
视频以其独特的时间维度,要求精确的接地理解,即答案需直接关联到可视且可解释的证据。尽管大语言模型在推理能力上取得了显著突破,但多模态推理——尤其是针对视频的推理——仍未被充分探索。在本研究中,我们推出了VideoMind,一款专为时间接地视频理解设计的新型视频-语言智能体。VideoMind融入了两大创新点:(i) 我们识别了视频时间推理的关键能力,并构建了一个基于角色的智能工作流,包括协调不同角色的规划器、负责时间定位的接地器、评估时间区间准确性的验证器,以及执行问答的解答器。(ii) 为高效整合这些多样角色,我们提出了一种新颖的Chain-of-LoRA策略,通过轻量级的LoRA适配器实现无缝角色切换,同时避免了多模型带来的开销,从而在效率与灵活性之间取得平衡。在14个公开基准上的广泛实验表明,我们的智能体在多样化视频理解任务中均达到了最先进的性能,包括3个接地视频问答、6个视频时间定位和5个通用视频问答任务,充分证明了其在推进视频智能体及长时态推理方面的有效性。
English
Videos, with their unique temporal dimension, demand precise grounded
understanding, where answers are directly linked to visual, interpretable
evidence. Despite significant breakthroughs in reasoning capabilities within
Large Language Models, multi-modal reasoning - especially for videos - remains
unexplored. In this work, we introduce VideoMind, a novel video-language agent
designed for temporal-grounded video understanding. VideoMind incorporates two
key innovations: (i) We identify essential capabilities for video temporal
reasoning and develop a role-based agentic workflow, including a planner for
coordinating different roles, a grounder for temporal localization, a verifier
to assess temporal interval accuracy, and an answerer for question-answering.
(ii) To efficiently integrate these diverse roles, we propose a novel
Chain-of-LoRA strategy, enabling seamless role-switching via lightweight LoRA
adaptors while avoiding the overhead of multiple models, thus balancing
efficiency and flexibility. Extensive experiments on 14 public benchmarks
demonstrate that our agent achieves state-of-the-art performance on diverse
video understanding tasks, including 3 on grounded video question-answering, 6
on video temporal grounding, and 5 on general video question-answering,
underscoring its effectiveness in advancing video agent and long-form temporal
reasoning.Summary
AI-Generated Summary