ProfVLM:面向多视角能力评估的轻量级视频-语言模型
ProfVLM: A Lightweight Video-Language Model for Multi-View Proficiency Estimation
September 30, 2025
作者: Edoardo Bianchi, Jacopo Staiano, Antonio Liotta
cs.AI
摘要
现有技能熟练度评估方法多依赖于黑箱视频分类器,忽视了多视角上下文且缺乏可解释性。我们提出ProfVLM,一个紧凑的视觉-语言模型,将这一任务重构为生成式推理:它从第一人称和第三人称视频中联合预测技能水平并生成专家级反馈。我们方法的核心在于一个注意力门控投影器,它动态融合了从冻结的TimeSformer骨干网络投影到为反馈生成调优的语言模型中的多视角特征。通过在EgoExo4D数据集上结合专家评论进行训练,ProfVLM在减少高达20倍参数和缩短60%训练时间的同时,超越了现有最先进方法。我们的方法不仅在多样活动中实现了更高的准确率,还输出了与表现一致的自然语言评价,提供了透明的推理过程。这些成果彰显了生成式视觉-语言建模作为技能评估领域一个强大新方向的潜力。
English
Existing approaches to skill proficiency estimation often rely on black-box
video classifiers, ignoring multi-view context and lacking explainability. We
present ProfVLM, a compact vision-language model that reformulates this task as
generative reasoning: it jointly predicts skill level and generates expert-like
feedback from egocentric and exocentric videos. Central to our method is an
AttentiveGatedProjector that dynamically fuses multi-view features, projected
from a frozen TimeSformer backbone into a language model tuned for feedback
generation. Trained on EgoExo4D with expert commentaries, ProfVLM surpasses
state-of-the-art methods while using up to 20x fewer parameters and reducing
training time by up to 60%. Our approach not only achieves superior accuracy
across diverse activities, but also outputs natural language critiques aligned
with performance, offering transparent reasoning. These results highlight
generative vision-language modeling as a powerful new direction for skill
assessment.