パラメータ効率型マルチビュー熟達度推定：識別的分類から生成的フィードバックへ

要旨

人がどの動作を行うかではなく、その動作の遂行度合いを評価することは、コーチング、リハビリテーション、人材発掘において核心的な課題である。この課題は難易度が高く、熟練度はタイミング、バランス、身体力学、実行における微妙な差異に符号化され、それが複数の視点や短い時間的イベントに分散しているためである。本稿では、Ego-Exo4Dにおける多視点熟練度推定への最近の3つの貢献を議論する。SkillFormerは、選択的多視点融合のためのパラメータ効率の良い識別的アーキテクチャを導入する。PATSは、基本動作の局所的に密な抜粋を保持することで時間的サンプリングを改善する。ProfVLMは、熟練度推定を条件付き言語生成として再定義し、ゲート付きクロスビュープロジェクタとコンパクトな言語バックボーンを通じて、熟練度ラベルと専門家様式のフィードバックの両方を生成する。これらの手法を組み合わせることで、Ego-Exo4Dにおいて、ビデオトランスフォーマーベースラインと比較して最大20倍少ない学習可能パラメータ、最大3倍少ない訓練エポックで state-of-the-art の精度を達成し、閉じた集合の分類から解釈可能なフィードバック生成へと移行している。これらの結果は、選択的融合、熟練度を考慮したサンプリング、実践可能な生成的フィードバックを組み合わせた、効率的な多視点システムへの移行を浮き彫りにしている。

English

Estimating how well a person performs an action, rather than which action is performed, is central to coaching, rehabilitation, and talent identification. This task is challenging because proficiency is encoded in subtle differences in timing, balance, body mechanics, and execution, often distributed across multiple views and short temporal events. We discuss three recent contributions to multi-view proficiency estimation on Ego-Exo4D. SkillFormer introduces a parameter-efficient discriminative architecture for selective multi-view fusion; PATS improves temporal sampling by preserving locally dense excerpts of fundamental movements; and ProfVLM reformulates proficiency estimation as conditional language generation, producing both a proficiency label and expert-style feedback through a gated cross-view projector and a compact language backbone. Together, these methods achieve state-of-the-art accuracy on Ego-Exo4D with up to 20x fewer trainable parameters and up to 3x fewer training epochs than video-transformer baselines, while moving from closed-set classification toward interpretable feedback generation. These results highlight a shift toward efficient, multi-view systems that combine selective fusion, proficiency-aware sampling, and actionable generative feedback.

パラメータ効率型マルチビュー熟達度推定：識別的分類から生成的フィードバックへ

Parameter-Efficient Multi-View Proficiency Estimation: From Discriminative Classification to Generative Feedback

要旨

Support