SkillFormer：面向技能評估的統一多視角視頻理解框架

摘要

評估人類在複雜活動中的技能水平是一個具有挑戰性的問題，在體育、康復和訓練等領域具有廣泛應用。在本研究中，我們提出了SkillFormer，這是一種參數高效的架構，用於從第一人稱和第三人稱視角視頻中進行統一的多視角熟練度估計。基於TimeSformer骨幹網絡，SkillFormer引入了CrossViewFusion模塊，該模塊利用多頭交叉注意力、可學習門控和自適應自校準來融合特定視角的特徵。我們採用低秩適應技術，僅微調一小部分參數，顯著降低了訓練成本。事實上，在EgoExo4D數據集上的評估顯示，SkillFormer在多視角設置中達到了最先進的準確性，同時展現出顯著的計算效率，使用的參數比先前基線少4.5倍，訓練周期少3.75倍。它在多種結構化任務中表現優異，證明了多視角整合在細粒度技能評估中的價值。

English

Assessing human skill levels in complex activities is a challenging problem with applications in sports, rehabilitation, and training. In this work, we present SkillFormer, a parameter-efficient architecture for unified multi-view proficiency estimation from egocentric and exocentric videos. Building on the TimeSformer backbone, SkillFormer introduces a CrossViewFusion module that fuses view-specific features using multi-head cross-attention, learnable gating, and adaptive self-calibration. We leverage Low-Rank Adaptation to fine-tune only a small subset of parameters, significantly reducing training costs. In fact, when evaluated on the EgoExo4D dataset, SkillFormer achieves state-of-the-art accuracy in multi-view settings while demonstrating remarkable computational efficiency, using 4.5x fewer parameters and requiring 3.75x fewer training epochs than prior baselines. It excels in multiple structured tasks, confirming the value of multi-view integration for fine-grained skill assessment.

SkillFormer：面向技能評估的統一多視角視頻理解框架

SkillFormer: Unified Multi-View Video Understanding for Proficiency Estimation

摘要

Support