ProfVLM：面向多视角能力评估的轻量级视频-语言模型

摘要

现有技能熟练度评估方法多依赖于黑箱视频分类器，忽视了多视角上下文且缺乏可解释性。我们提出ProfVLM，一个紧凑的视觉-语言模型，将这一任务重构为生成式推理：它从第一人称和第三人称视频中联合预测技能水平并生成专家级反馈。我们方法的核心在于一个注意力门控投影器，它动态融合了从冻结的TimeSformer骨干网络投影到为反馈生成调优的语言模型中的多视角特征。通过在EgoExo4D数据集上结合专家评论进行训练，ProfVLM在减少高达20倍参数和缩短60%训练时间的同时，超越了现有最先进方法。我们的方法不仅在多样活动中实现了更高的准确率，还输出了与表现一致的自然语言评价，提供了透明的推理过程。这些成果彰显了生成式视觉-语言建模作为技能评估领域一个强大新方向的潜力。

English

Existing approaches to skill proficiency estimation often rely on black-box video classifiers, ignoring multi-view context and lacking explainability. We present ProfVLM, a compact vision-language model that reformulates this task as generative reasoning: it jointly predicts skill level and generates expert-like feedback from egocentric and exocentric videos. Central to our method is an AttentiveGatedProjector that dynamically fuses multi-view features, projected from a frozen TimeSformer backbone into a language model tuned for feedback generation. Trained on EgoExo4D with expert commentaries, ProfVLM surpasses state-of-the-art methods while using up to 20x fewer parameters and reducing training time by up to 60%. Our approach not only achieves superior accuracy across diverse activities, but also outputs natural language critiques aligned with performance, offering transparent reasoning. These results highlight generative vision-language modeling as a powerful new direction for skill assessment.

ProfVLM：面向多视角能力评估的轻量级视频-语言模型

ProfVLM: A Lightweight Video-Language Model for Multi-View Proficiency Estimation

摘要

Support