ProfVLM: 다중 관점 숙련도 추정을 위한 경량 비디오-언어 모델

초록

기존의 기술 숙련도 추정 접근법은 종종 블랙박스 비디오 분류기에 의존하며, 다중 시점 컨텍스트를 무시하고 설명 가능성이 부족합니다. 우리는 이 작업을 생성적 추론으로 재구성한 컴팩트한 비전-언어 모델인 ProfVLM을 제안합니다. 이 모델은 자기 중심적 및 타자 중심적 비디오에서 기술 수준을 예측하고 전문가 같은 피드백을 생성합니다. 우리 방법의 핵심은 고정된 TimeSformer 백본에서 언어 모델로 투영된 다중 시점 특징을 동적으로 융합하는 AttentiveGatedProjector입니다. 전문가 코멘터리가 포함된 EgoExo4D 데이터셋으로 학습된 ProfVLM은 최대 20배 적은 매개변수를 사용하고 학습 시간을 최대 60% 단축하면서도 최첨단 방법을 능가합니다. 우리의 접근법은 다양한 활동에서 우수한 정확도를 달성할 뿐만 아니라, 성능과 일치하는 자연어 비평을 출력하여 투명한 추론을 제공합니다. 이러한 결과는 기술 평가를 위한 강력한 새로운 방향으로서 생성적 비전-언어 모델링의 잠재력을 강조합니다.

English

Existing approaches to skill proficiency estimation often rely on black-box video classifiers, ignoring multi-view context and lacking explainability. We present ProfVLM, a compact vision-language model that reformulates this task as generative reasoning: it jointly predicts skill level and generates expert-like feedback from egocentric and exocentric videos. Central to our method is an AttentiveGatedProjector that dynamically fuses multi-view features, projected from a frozen TimeSformer backbone into a language model tuned for feedback generation. Trained on EgoExo4D with expert commentaries, ProfVLM surpasses state-of-the-art methods while using up to 20x fewer parameters and reducing training time by up to 60%. Our approach not only achieves superior accuracy across diverse activities, but also outputs natural language critiques aligned with performance, offering transparent reasoning. These results highlight generative vision-language modeling as a powerful new direction for skill assessment.

ProfVLM: 다중 관점 숙련도 추정을 위한 경량 비디오-언어 모델

ProfVLM: A Lightweight Video-Language Model for Multi-View Proficiency Estimation

초록

Support