ProfVLM：一款輕量級視頻語言模型，用於多視角熟練度評估

摘要

現有的技能熟練度評估方法通常依賴於黑箱式的視頻分類器，忽視了多視角上下文且缺乏可解釋性。我們提出了ProfVLM，這是一個緊湊的視覺-語言模型，將此任務重新構建為生成式推理：它能夠從第一人稱和第三人稱視頻中聯合預測技能水平並生成專家級的反饋。我們方法的核心是一個注意力門控投影器，它動態融合了從凍結的TimeSformer骨幹網絡投影到專為反饋生成調整的語言模型中的多視角特徵。在EgoExo4D數據集上結合專家評論進行訓練後，ProfVLM超越了現有最先進的方法，同時使用了最多20倍少的參數，並將訓練時間縮短了最多60%。我們的方法不僅在多樣化的活動中實現了更高的準確性，而且輸出了與表現相符的自然語言評析，提供了透明的推理過程。這些結果凸顯了生成式視覺-語言建模作為技能評估的一個強大新方向。

English

Existing approaches to skill proficiency estimation often rely on black-box video classifiers, ignoring multi-view context and lacking explainability. We present ProfVLM, a compact vision-language model that reformulates this task as generative reasoning: it jointly predicts skill level and generates expert-like feedback from egocentric and exocentric videos. Central to our method is an AttentiveGatedProjector that dynamically fuses multi-view features, projected from a frozen TimeSformer backbone into a language model tuned for feedback generation. Trained on EgoExo4D with expert commentaries, ProfVLM surpasses state-of-the-art methods while using up to 20x fewer parameters and reducing training time by up to 60%. Our approach not only achieves superior accuracy across diverse activities, but also outputs natural language critiques aligned with performance, offering transparent reasoning. These results highlight generative vision-language modeling as a powerful new direction for skill assessment.

ProfVLM：一款輕量級視頻語言模型，用於多視角熟練度評估

ProfVLM: A Lightweight Video-Language Model for Multi-View Proficiency Estimation

摘要

Support