参数高效的多视图能力评估：从判别式分类到生成式反馈

摘要

准确评估个体执行动作的熟练程度（而非仅识别动作类型）是运动教练、康复治疗和人才选拔的核心任务。这一挑战的难点在于：专业水平往往通过时间节奏、平衡控制、身体力学及动作执行中细微差异来体现，这些特征通常分散在多视角视频片段和短暂的时间事件中。本文讨论了Ego-Exo4D数据集上多视角熟练度估计的三项最新进展：SkillFormer提出参数高效的判别式架构，实现选择性多视角特征融合；PATS通过保留基础动作的局部密集片段改进时序采样；ProfVLM将熟练度估计重构为条件语言生成任务，通过门控跨视角投影器和精简语言模型，同时输出熟练度标签和专家级评估反馈。相比视频Transformer基线，这些方法以最高减少20倍可训练参数和3倍训练周期，在Ego-Exo4D上达到最优精度，同时实现了从封闭式分类向可解释反馈生成的跨越。这些成果标志着多视角系统正朝着选择性融合、熟练度感知采样和可操作性生成反馈相结合的高效方向发展。

English

Estimating how well a person performs an action, rather than which action is performed, is central to coaching, rehabilitation, and talent identification. This task is challenging because proficiency is encoded in subtle differences in timing, balance, body mechanics, and execution, often distributed across multiple views and short temporal events. We discuss three recent contributions to multi-view proficiency estimation on Ego-Exo4D. SkillFormer introduces a parameter-efficient discriminative architecture for selective multi-view fusion; PATS improves temporal sampling by preserving locally dense excerpts of fundamental movements; and ProfVLM reformulates proficiency estimation as conditional language generation, producing both a proficiency label and expert-style feedback through a gated cross-view projector and a compact language backbone. Together, these methods achieve state-of-the-art accuracy on Ego-Exo4D with up to 20x fewer trainable parameters and up to 3x fewer training epochs than video-transformer baselines, while moving from closed-set classification toward interpretable feedback generation. These results highlight a shift toward efficient, multi-view systems that combine selective fusion, proficiency-aware sampling, and actionable generative feedback.

参数高效的多视图能力评估：从判别式分类到生成式反馈

Parameter-Efficient Multi-View Proficiency Estimation: From Discriminative Classification to Generative Feedback

摘要

Support