ProfVLM:一款輕量級視頻語言模型,用於多視角熟練度評估
ProfVLM: A Lightweight Video-Language Model for Multi-View Proficiency Estimation
September 30, 2025
作者: Edoardo Bianchi, Jacopo Staiano, Antonio Liotta
cs.AI
摘要
現有的技能熟練度評估方法通常依賴於黑箱式的視頻分類器,忽視了多視角上下文且缺乏可解釋性。我們提出了ProfVLM,這是一個緊湊的視覺-語言模型,將此任務重新構建為生成式推理:它能夠從第一人稱和第三人稱視頻中聯合預測技能水平並生成專家級的反饋。我們方法的核心是一個注意力門控投影器,它動態融合了從凍結的TimeSformer骨幹網絡投影到專為反饋生成調整的語言模型中的多視角特徵。在EgoExo4D數據集上結合專家評論進行訓練後,ProfVLM超越了現有最先進的方法,同時使用了最多20倍少的參數,並將訓練時間縮短了最多60%。我們的方法不僅在多樣化的活動中實現了更高的準確性,而且輸出了與表現相符的自然語言評析,提供了透明的推理過程。這些結果凸顯了生成式視覺-語言建模作為技能評估的一個強大新方向。
English
Existing approaches to skill proficiency estimation often rely on black-box
video classifiers, ignoring multi-view context and lacking explainability. We
present ProfVLM, a compact vision-language model that reformulates this task as
generative reasoning: it jointly predicts skill level and generates expert-like
feedback from egocentric and exocentric videos. Central to our method is an
AttentiveGatedProjector that dynamically fuses multi-view features, projected
from a frozen TimeSformer backbone into a language model tuned for feedback
generation. Trained on EgoExo4D with expert commentaries, ProfVLM surpasses
state-of-the-art methods while using up to 20x fewer parameters and reducing
training time by up to 60%. Our approach not only achieves superior accuracy
across diverse activities, but also outputs natural language critiques aligned
with performance, offering transparent reasoning. These results highlight
generative vision-language modeling as a powerful new direction for skill
assessment.