ProfVLM：多視点熟達度推定のための軽量ビデオ言語モデル

要旨

既存のスキル熟練度推定手法は、しばしばブラックボックス的なビデオ分類器に依存しており、マルチビューコンテキストを無視し、説明可能性を欠いている。本研究では、この課題を生成的推論として再定式化するコンパクトな視覚言語モデル、ProfVLMを提案する。ProfVLMは、エゴセントリックおよびエクソセントリックビデオからスキルレベルを予測し、専門家のようなフィードバックを生成する。本手法の中核となるのは、AttentiveGatedProjectorであり、凍結されたTimeSformerバックボーンから投影されたマルチビュー特徴量を動的に融合し、フィードバック生成に適した言語モデルに投影する。専門家の解説付きデータセットEgoExo4Dで学習されたProfVLMは、最大20倍少ないパラメータを使用し、学習時間を最大60％削減しながら、最先端の手法を凌駕する。本アプローチは、多様な活動において優れた精度を達成するだけでなく、パフォーマンスに沿った自然言語による批評を出力し、透明性のある推論を提供する。これらの結果は、生成的視覚言語モデリングがスキル評価のための強力な新たな方向性であることを示している。

English

Existing approaches to skill proficiency estimation often rely on black-box video classifiers, ignoring multi-view context and lacking explainability. We present ProfVLM, a compact vision-language model that reformulates this task as generative reasoning: it jointly predicts skill level and generates expert-like feedback from egocentric and exocentric videos. Central to our method is an AttentiveGatedProjector that dynamically fuses multi-view features, projected from a frozen TimeSformer backbone into a language model tuned for feedback generation. Trained on EgoExo4D with expert commentaries, ProfVLM surpasses state-of-the-art methods while using up to 20x fewer parameters and reducing training time by up to 60%. Our approach not only achieves superior accuracy across diverse activities, but also outputs natural language critiques aligned with performance, offering transparent reasoning. These results highlight generative vision-language modeling as a powerful new direction for skill assessment.

ProfVLM：多視点熟達度推定のための軽量ビデオ言語モデル

ProfVLM: A Lightweight Video-Language Model for Multi-View Proficiency Estimation

要旨

Support