SkillFormer:面向技能评估的统一多视角视频理解框架
SkillFormer: Unified Multi-View Video Understanding for Proficiency Estimation
May 13, 2025
作者: Edoardo Bianchi, Antonio Liotta
cs.AI
摘要
评估人类在复杂活动中的技能水平是一个具有挑战性的问题,在体育、康复和训练等领域有着广泛应用。本研究中,我们提出了SkillFormer,一种参数高效的结构,用于从第一人称和第三人称视频中进行统一的多视角熟练度估计。基于TimeSformer主干网络,SkillFormer引入了CrossViewFusion模块,该模块通过多头交叉注意力、可学习门控和自适应自校准机制融合视角特定特征。我们采用低秩适应技术,仅微调一小部分参数,显著降低了训练成本。事实上,在EgoExo4D数据集上的评估显示,SkillFormer在多视角设置下达到了最先进的准确率,同时展现出卓越的计算效率,其参数数量减少了4.5倍,训练轮次需求比先前基线减少了3.75倍。该模型在多项结构化任务中表现优异,证实了多视角整合在细粒度技能评估中的价值。
English
Assessing human skill levels in complex activities is a challenging problem
with applications in sports, rehabilitation, and training. In this work, we
present SkillFormer, a parameter-efficient architecture for unified multi-view
proficiency estimation from egocentric and exocentric videos. Building on the
TimeSformer backbone, SkillFormer introduces a CrossViewFusion module that
fuses view-specific features using multi-head cross-attention, learnable
gating, and adaptive self-calibration. We leverage Low-Rank Adaptation to
fine-tune only a small subset of parameters, significantly reducing training
costs. In fact, when evaluated on the EgoExo4D dataset, SkillFormer achieves
state-of-the-art accuracy in multi-view settings while demonstrating remarkable
computational efficiency, using 4.5x fewer parameters and requiring 3.75x fewer
training epochs than prior baselines. It excels in multiple structured tasks,
confirming the value of multi-view integration for fine-grained skill
assessment.Summary
AI-Generated Summary