매개변수 효율적 다중 관점 숙련도 추정: 변별적 분류에서 생성적 피드백으로

초록

사람이 수행하는 행동 자체가 아닌 행동 수행의 숙련도를 평가하는 것은 코칭, 재활, 인재 발굴 분야에서 핵심적입니다. 숙련도는 타이밍, 균형, 신체 역학, 실행 과정에서 나타나는 미세한 차이에 인코딩되며, 이러한 신호는 종종 다중 시점과 짧은 시간적 사건에 분산되어 있어 평가 작업이 어렵습니다. 본 논문에서는 Ego-Exo4D 데이터셋을 활용한 다중 시점 숙련도 평가에 대한 최근 세 가지 방법론을 논의합니다. SkillFormer는 선택적 다중 시점 융합을 위한 매개변수 효율적인 판별 아키텍처를 제안하며, PATS는 기본 동작의 국소적 밀집 excerpts를 보존하여 시간적 샘플링을 개선합니다. ProfVLM은 숙련도 평가를 조건부 언어 생성 작업으로 재정의하여, 게이트 방식의 교차 시점 투사기와 간결한 언어 백본을 통해 숙련도 레이블과 전문가 스타일의 피드백을 동시에 생성합니다. 이러한 방법들은 종합적으로 비디오 트랜스포머 기반선 대비 최대 20배 적은 학습 매개변수와 최대 3배 적은 학습 에폭으로 Ego-Exo4D에서 최첨단 정확도를 달성하며, 폐쇄형 분류에서 해석 가능한 피드백 생성으로의 전환을 이끕니다. 이러한 결과는 선택적 융합, 숙련도 인식 샘플링, 실행 가능한 생성형 피드백을 결합한 효율적인 다중 시점 시스템으로의 패러다임 전환을 강조합니다.

English

Estimating how well a person performs an action, rather than which action is performed, is central to coaching, rehabilitation, and talent identification. This task is challenging because proficiency is encoded in subtle differences in timing, balance, body mechanics, and execution, often distributed across multiple views and short temporal events. We discuss three recent contributions to multi-view proficiency estimation on Ego-Exo4D. SkillFormer introduces a parameter-efficient discriminative architecture for selective multi-view fusion; PATS improves temporal sampling by preserving locally dense excerpts of fundamental movements; and ProfVLM reformulates proficiency estimation as conditional language generation, producing both a proficiency label and expert-style feedback through a gated cross-view projector and a compact language backbone. Together, these methods achieve state-of-the-art accuracy on Ego-Exo4D with up to 20x fewer trainable parameters and up to 3x fewer training epochs than video-transformer baselines, while moving from closed-set classification toward interpretable feedback generation. These results highlight a shift toward efficient, multi-view systems that combine selective fusion, proficiency-aware sampling, and actionable generative feedback.

매개변수 효율적 다중 관점 숙련도 추정: 변별적 분류에서 생성적 피드백으로

Parameter-Efficient Multi-View Proficiency Estimation: From Discriminative Classification to Generative Feedback

초록

Support