参数高效的多视角能力评估：从判别式分类到生成式反馈

摘要

评估个体执行动作的熟练程度（而非识别动作类型）是运动指导、康复训练和人才选拔的核心任务。该任务具有挑战性，因为熟练度体现在时间控制、平衡能力、身体力学和执行效果等细微差异中，这些特征往往分散在多视角视频和短暂时序事件里。我们探讨了Ego-Exo4D数据集上多视角熟练度估计的三项最新成果：SkillFormer提出参数高效的判别式架构实现选择性多视角融合；PATS通过保留基础动作的局部密集片段改进时序采样；ProfVLM将熟练度估计重构为条件语言生成任务，通过门控跨视角投影器和紧凑语言骨干网络同时生成熟练度标签和专家级反馈。这些方法在Ego-Exo4D上达到最先进准确率，相比视频Transformer基线可训练参数减少20倍、训练轮次降低3倍，同时实现了从封闭集分类到可解释反馈生成的跨越。这些成果标志着多视角系统正朝着选择性融合、熟练度感知采样和可操作生成反馈相结合的高效方向发展。

English

Estimating how well a person performs an action, rather than which action is performed, is central to coaching, rehabilitation, and talent identification. This task is challenging because proficiency is encoded in subtle differences in timing, balance, body mechanics, and execution, often distributed across multiple views and short temporal events. We discuss three recent contributions to multi-view proficiency estimation on Ego-Exo4D. SkillFormer introduces a parameter-efficient discriminative architecture for selective multi-view fusion; PATS improves temporal sampling by preserving locally dense excerpts of fundamental movements; and ProfVLM reformulates proficiency estimation as conditional language generation, producing both a proficiency label and expert-style feedback through a gated cross-view projector and a compact language backbone. Together, these methods achieve state-of-the-art accuracy on Ego-Exo4D with up to 20x fewer trainable parameters and up to 3x fewer training epochs than video-transformer baselines, while moving from closed-set classification toward interpretable feedback generation. These results highlight a shift toward efficient, multi-view systems that combine selective fusion, proficiency-aware sampling, and actionable generative feedback.

参数高效的多视角能力评估：从判别式分类到生成式反馈

Parameter-Efficient Multi-View Proficiency Estimation: From Discriminative Classification to Generative Feedback

摘要

Support