Video-Skill-CoT: ドメイン適応型ビデオ推論のためのスキルベース連鎖思考

要旨

最近のChain-of-Thought（CoT）推論の進展により、複雑なビデオ理解が向上していますが、既存の手法は様々なビデオコンテンツにわたるドメイン固有のスキル（例：イベント検出、空間関係理解、感情理解）に適応するのに苦労することが多いです。この問題に対処するため、我々はVideo-Skill-CoT（別名Video-SKoT）を提案します。これは、ドメイン適応型ビデオ推論のためのスキルを意識したCoTの監視を自動的に構築し活用するフレームワークです。まず、スキルベースのCoTアノテーションを構築します。トレーニング質問からドメインに関連する推論スキルを抽出し、共有スキル分類にクラスタリングし、各ビデオ質問ペアに合わせた詳細な多段階CoT根拠を作成します。次に、スキル固有のエキスパート学習フレームワークを導入します。各エキスパートモジュールは推論スキルのサブセットに特化し、収集したCoT監視を使用して軽量アダプターでトレーニングされます。提案手法の有効性を3つのビデオ理解ベンチマークで実証し、Video-SKoTが強力なベースラインを一貫して上回ることを示します。また、複数のビデオドメインにわたる異なるCoTアノテーションパイプラインと学習されたスキルの比較に関する詳細な分析も提供します。

English

Recent advances in Chain-of-Thought (CoT) reasoning have improved complex video understanding, but existing methods often struggle to adapt to domain-specific skills (e.g., event detection, spatial relation understanding, emotion understanding) over various video content. To address this, we propose Video-Skill-CoT (a.k.a. Video-SKoT), a framework that automatically constructs and leverages skill-aware CoT supervisions for domain-adaptive video reasoning. First, we construct skill-based CoT annotations: we extract domain-relevant reasoning skills from training questions, cluster them into a shared skill taxonomy, and create detailed multi-step CoT rationale tailored to each video-question pair for training. Second, we introduce a skill-specific expert learning framework. Each expert module specializes in a subset of reasoning skills and is trained with lightweight adapters using the collected CoT supervision. We demonstrate the effectiveness of the proposed approach on three video understanding benchmarks, where Video-SKoT consistently outperforms strong baselines. We also provide in-depth analyses on comparing different CoT annotation pipelines and learned skills over multiple video domains.

Video-Skill-CoT: ドメイン適応型ビデオ推論のためのスキルベース連鎖思考

Video-Skill-CoT: Skill-based Chain-of-Thoughts for Domain-Adaptive Video Reasoning

要旨

Support