ChatPaper.aiChatPaper

Video-Skill-CoT:面向领域自适应视频推理的技能链式思维

Video-Skill-CoT: Skill-based Chain-of-Thoughts for Domain-Adaptive Video Reasoning

June 4, 2025
作者: Daeun Lee, Jaehong Yoon, Jaemin Cho, Mohit Bansal
cs.AI

摘要

近期,思维链(CoT)推理的进展提升了复杂视频理解的能力,但现有方法往往难以适应不同视频内容中的领域特定技能(如事件检测、空间关系理解、情感理解)。为解决这一问题,我们提出了视频技能思维链(Video-Skill-CoT,简称Video-SKoT)框架,该框架自动构建并利用技能感知的CoT监督,实现领域自适应的视频推理。首先,我们构建基于技能的CoT标注:从训练问题中提取与领域相关的推理技能,将其聚类为共享的技能分类体系,并为每个视频-问题对创建详细的多步骤CoT推理依据。其次,我们引入了一个技能特定的专家学习框架。每个专家模块专注于一部分推理技能,并通过轻量级适配器使用收集的CoT监督进行训练。我们在三个视频理解基准上验证了所提方法的有效性,Video-SKoT在这些基准上持续超越强基线模型。此外,我们还深入分析了不同CoT标注流程及在多个视频领域中学到的技能之间的对比。
English
Recent advances in Chain-of-Thought (CoT) reasoning have improved complex video understanding, but existing methods often struggle to adapt to domain-specific skills (e.g., event detection, spatial relation understanding, emotion understanding) over various video content. To address this, we propose Video-Skill-CoT (a.k.a. Video-SKoT), a framework that automatically constructs and leverages skill-aware CoT supervisions for domain-adaptive video reasoning. First, we construct skill-based CoT annotations: we extract domain-relevant reasoning skills from training questions, cluster them into a shared skill taxonomy, and create detailed multi-step CoT rationale tailored to each video-question pair for training. Second, we introduce a skill-specific expert learning framework. Each expert module specializes in a subset of reasoning skills and is trained with lightweight adapters using the collected CoT supervision. We demonstrate the effectiveness of the proposed approach on three video understanding benchmarks, where Video-SKoT consistently outperforms strong baselines. We also provide in-depth analyses on comparing different CoT annotation pipelines and learned skills over multiple video domains.
PDF52June 5, 2025