야생 환경에서의 동반 발화 제스처 이해

초록

동반 발화 제스처는 비언어적 커뮤니케이션에서 중요한 역할을 합니다. 본 논문에서는 실제 환경에서의 동반 발화 제스처 이해를 위한 새로운 프레임워크를 소개합니다. 구체적으로, 모델의 제스처-텍스트-음성 연관성 이해 능력을 평가하기 위한 세 가지 새로운 과제와 벤치마크를 제안합니다: (i) 제스처 기반 검색, (ii) 제스처 단어 탐지, (iii) 제스처를 활용한 활성 발화자 탐지. 우리는 이러한 과제를 해결하기 위해 음성-텍스트-비디오-제스처의 삼중 모달 표현을 학습하는 새로운 접근 방식을 제시합니다. 글로벌 구문 대조 손실과 로컬 제스처-단어 결합 손실을 결합하여, 실제 환경의 비디오로부터 약한 감독 하에서 강력한 제스처 표현을 학습할 수 있음을 보여줍니다. 우리가 학습한 표현은 대규모 시각-언어 모델(VLMs)을 포함한 기존 방법들을 모든 세 과제에서 능가합니다. 추가 분석을 통해 음성과 텍스트 모달리티가 서로 다른 제스처 관련 신호를 포착함을 확인하였으며, 이는 공유 삼중 모달 임베딩 공간 학습의 장점을 강조합니다. 데이터셋, 모델, 코드는 다음에서 확인할 수 있습니다: https://www.robots.ox.ac.uk/~vgg/research/jegal

English

Co-speech gestures play a vital role in non-verbal communication. In this paper, we introduce a new framework for co-speech gesture understanding in the wild. Specifically, we propose three new tasks and benchmarks to evaluate a model's capability to comprehend gesture-text-speech associations: (i) gesture-based retrieval, (ii) gestured word spotting, and (iii) active speaker detection using gestures. We present a new approach that learns a tri-modal speech-text-video-gesture representation to solve these tasks. By leveraging a combination of global phrase contrastive loss and local gesture-word coupling loss, we demonstrate that a strong gesture representation can be learned in a weakly supervised manner from videos in the wild. Our learned representations outperform previous methods, including large vision-language models (VLMs), across all three tasks. Further analysis reveals that speech and text modalities capture distinct gesture-related signals, underscoring the advantages of learning a shared tri-modal embedding space. The dataset, model, and code are available at: https://www.robots.ox.ac.uk/~vgg/research/jegal

야생 환경에서의 동반 발화 제스처 이해

Understanding Co-speech Gestures in-the-wild

초록

Support