Parameter-efficiënte multi-view bekwaamheidsschatting: Van discriminerende classificatie naar generatieve feedback

Samenvatting

Het inschatten van hoe goed iemand een handeling uitvoert, in plaats van welke handeling wordt uitgevoerd, is cruciaal voor coaching, revalidatie en talentherkenning. Deze taak is uitdagend omdat vaardigheid wordt weergegeven in subtiele verschillen in timing, balans, lichaamsmechanica en uitvoering, die vaak verspreid zijn over meerdere camerabeelden en korte temporele gebeurtenissen. We bespreken drie recente bijdragen aan multi-view vaardigheidsinschatting op Ego-Exo4D. SkillFormer introduceert een parameter-efficiënte discriminerende architectuur voor selectieve multi-view fusie; PATS verbetert temporele sampling door lokaal dichte fragmenten van fundamentele bewegingen te behouden; en ProfVLM herformuleert vaardigheidsinschatting als conditionele taalgeneratie, waarbij zowel een vaardigheidslabel als expert-achtige feedback wordt gegenereerd via een gegradeerde cross-view projectiemodule en een compact taalbackbone. Gezamenlijk bereiken deze methoden state-of-the-art nauwkeurigheid op Ego-Exo4D met tot 20x minder trainbare parameters en tot 3x minder trainingsepochs dan video-transformer-baselines, terwijl ze verschuiven van gesloten-set classificatie naar interpreteerbare feedbackgeneratie. Deze resultaten benadrukken een verschuiving naar efficiënte, multi-view systemen die selectieve fusie, vaardigheidsbewuste sampling en actiegerichte generatieve feedback combineren.

English

Estimating how well a person performs an action, rather than which action is performed, is central to coaching, rehabilitation, and talent identification. This task is challenging because proficiency is encoded in subtle differences in timing, balance, body mechanics, and execution, often distributed across multiple views and short temporal events. We discuss three recent contributions to multi-view proficiency estimation on Ego-Exo4D. SkillFormer introduces a parameter-efficient discriminative architecture for selective multi-view fusion; PATS improves temporal sampling by preserving locally dense excerpts of fundamental movements; and ProfVLM reformulates proficiency estimation as conditional language generation, producing both a proficiency label and expert-style feedback through a gated cross-view projector and a compact language backbone. Together, these methods achieve state-of-the-art accuracy on Ego-Exo4D with up to 20x fewer trainable parameters and up to 3x fewer training epochs than video-transformer baselines, while moving from closed-set classification toward interpretable feedback generation. These results highlight a shift toward efficient, multi-view systems that combine selective fusion, proficiency-aware sampling, and actionable generative feedback.

Parameter-efficiënte multi-view bekwaamheidsschatting: Van discriminerende classificatie naar generatieve feedback

Parameter-Efficient Multi-View Proficiency Estimation: From Discriminative Classification to Generative Feedback

Samenvatting

Support