Video-Skill-CoT：面向领域自适应视频推理的技能链式思维

摘要

近期，思维链（CoT）推理的进展提升了复杂视频理解的能力，但现有方法往往难以适应不同视频内容中的领域特定技能（如事件检测、空间关系理解、情感理解）。为解决这一问题，我们提出了视频技能思维链（Video-Skill-CoT，简称Video-SKoT）框架，该框架自动构建并利用技能感知的CoT监督，实现领域自适应的视频推理。首先，我们构建基于技能的CoT标注：从训练问题中提取与领域相关的推理技能，将其聚类为共享的技能分类体系，并为每个视频-问题对创建详细的多步骤CoT推理依据。其次，我们引入了一个技能特定的专家学习框架。每个专家模块专注于一部分推理技能，并通过轻量级适配器使用收集的CoT监督进行训练。我们在三个视频理解基准上验证了所提方法的有效性，Video-SKoT在这些基准上持续超越强基线模型。此外，我们还深入分析了不同CoT标注流程及在多个视频领域中学到的技能之间的对比。

English

Recent advances in Chain-of-Thought (CoT) reasoning have improved complex video understanding, but existing methods often struggle to adapt to domain-specific skills (e.g., event detection, spatial relation understanding, emotion understanding) over various video content. To address this, we propose Video-Skill-CoT (a.k.a. Video-SKoT), a framework that automatically constructs and leverages skill-aware CoT supervisions for domain-adaptive video reasoning. First, we construct skill-based CoT annotations: we extract domain-relevant reasoning skills from training questions, cluster them into a shared skill taxonomy, and create detailed multi-step CoT rationale tailored to each video-question pair for training. Second, we introduce a skill-specific expert learning framework. Each expert module specializes in a subset of reasoning skills and is trained with lightweight adapters using the collected CoT supervision. We demonstrate the effectiveness of the proposed approach on three video understanding benchmarks, where Video-SKoT consistently outperforms strong baselines. We also provide in-depth analyses on comparing different CoT annotation pipelines and learned skills over multiple video domains.

Video-Skill-CoT：面向领域自适应视频推理的技能链式思维

Video-Skill-CoT: Skill-based Chain-of-Thoughts for Domain-Adaptive Video Reasoning

摘要

Support