Video-Skill-CoT: 도메인 적응형 비디오 추론을 위한 스킬 기반 사고의 연쇄

초록

최근 Chain-of-Thought(CoT) 추론의 발전은 복잡한 비디오 이해를 개선했지만, 기존 방법들은 다양한 비디오 콘텐츠에 걸쳐 도메인 특화 기술(예: 이벤트 탐지, 공간 관계 이해, 감정 이해)에 적응하는 데 어려움을 겪는 경우가 많습니다. 이를 해결하기 위해, 우리는 도메인 적응형 비디오 추론을 위해 기술 인식 CoT 감독을 자동으로 구성하고 활용하는 Video-Skill-CoT(일명 Video-SKoT) 프레임워크를 제안합니다. 먼저, 우리는 기술 기반 CoT 주석을 구성합니다: 훈련 질문에서 도메인 관련 추론 기술을 추출하고, 이를 공유 기술 분류 체계로 클러스터링하며, 각 비디오-질문 쌍에 맞춘 다단계 CoT 근거를 상세히 작성하여 훈련에 사용합니다. 둘째, 우리는 기술 특화 전문가 학습 프레임워크를 도입합니다. 각 전문가 모듈은 추론 기술의 하위 집합에 특화되어 있으며, 수집된 CoT 감독을 사용하여 경량 어댑터로 훈련됩니다. 우리는 제안된 접근법의 효과를 세 가지 비디오 이해 벤치마크에서 입증하며, Video-SKoT가 강력한 베이스라인을 지속적으로 능가함을 보여줍니다. 또한, 여러 비디오 도메인에 걸쳐 다양한 CoT 주석 파이프라인과 학습된 기술을 비교하는 심층 분석을 제공합니다.

English

Recent advances in Chain-of-Thought (CoT) reasoning have improved complex video understanding, but existing methods often struggle to adapt to domain-specific skills (e.g., event detection, spatial relation understanding, emotion understanding) over various video content. To address this, we propose Video-Skill-CoT (a.k.a. Video-SKoT), a framework that automatically constructs and leverages skill-aware CoT supervisions for domain-adaptive video reasoning. First, we construct skill-based CoT annotations: we extract domain-relevant reasoning skills from training questions, cluster them into a shared skill taxonomy, and create detailed multi-step CoT rationale tailored to each video-question pair for training. Second, we introduce a skill-specific expert learning framework. Each expert module specializes in a subset of reasoning skills and is trained with lightweight adapters using the collected CoT supervision. We demonstrate the effectiveness of the proposed approach on three video understanding benchmarks, where Video-SKoT consistently outperforms strong baselines. We also provide in-depth analyses on comparing different CoT annotation pipelines and learned skills over multiple video domains.

Video-Skill-CoT: 도메인 적응형 비디오 추론을 위한 스킬 기반 사고의 연쇄

Video-Skill-CoT: Skill-based Chain-of-Thoughts for Domain-Adaptive Video Reasoning

초록

Support