理解真實場景中的伴隨語音手勢
Understanding Co-speech Gestures in-the-wild
March 28, 2025
作者: Sindhu B Hegde, K R Prajwal, Taein Kwon, Andrew Zisserman
cs.AI
摘要
伴隨語言的肢體動作在非語言交流中扮演著至關重要的角色。本文提出了一種新的框架,用於在自然場景下理解伴隨語言的肢體動作。具體而言,我們提出了三個新任務和基準,以評估模型理解動作-文本-語音關聯的能力:(i) 基於動作的檢索,(ii) 動作詞語識別,以及 (iii) 使用動作的主動說話者檢測。我們提出了一種新方法,通過學習三模態的語音-文本-視頻-動作表示來解決這些任務。通過結合全局短語對比損失和局部動作-詞語耦合損失,我們展示了可以從自然場景的視頻中以弱監督的方式學習到強大的動作表示。我們學習到的表示在所有三個任務中都優於先前的方法,包括大型視覺-語言模型(VLMs)。進一步的分析表明,語音和文本模態捕捉到了不同的動作相關信號,這凸顯了學習共享的三模態嵌入空間的優勢。數據集、模型和代碼可在以下網址獲取:https://www.robots.ox.ac.uk/~vgg/research/jegal
English
Co-speech gestures play a vital role in non-verbal communication. In this
paper, we introduce a new framework for co-speech gesture understanding in the
wild. Specifically, we propose three new tasks and benchmarks to evaluate a
model's capability to comprehend gesture-text-speech associations: (i)
gesture-based retrieval, (ii) gestured word spotting, and (iii) active speaker
detection using gestures. We present a new approach that learns a tri-modal
speech-text-video-gesture representation to solve these tasks. By leveraging a
combination of global phrase contrastive loss and local gesture-word coupling
loss, we demonstrate that a strong gesture representation can be learned in a
weakly supervised manner from videos in the wild. Our learned representations
outperform previous methods, including large vision-language models (VLMs),
across all three tasks. Further analysis reveals that speech and text
modalities capture distinct gesture-related signals, underscoring the
advantages of learning a shared tri-modal embedding space. The dataset, model,
and code are available at: https://www.robots.ox.ac.uk/~vgg/research/jegalSummary
AI-Generated Summary