実環境における共話ジェスチャーの理解

要旨

共話ジェスチャーは非言語コミュニケーションにおいて重要な役割を果たします。本論文では、自然環境下での共話ジェスチャー理解のための新しいフレームワークを提案します。具体的には、モデルのジェスチャー・テキスト・音声の関連性を理解する能力を評価するための3つの新しいタスクとベンチマークを提示します：(i) ジェスチャーに基づく検索、(ii) ジェスチャー付き単語の特定、(iii) ジェスチャーを用いたアクティブスピーカー検出。これらのタスクを解決するために、音声・テキスト・映像・ジェスチャーの三モーダル表現を学習する新しいアプローチを提案します。グローバルなフレーズコントラスト損失とローカルなジェスチャー・単語結合損失を組み合わせることで、自然環境下の映像から弱教師あり学習によって強力なジェスチャー表現を学習できることを実証します。我々の学習した表現は、大規模な視覚言語モデル（VLM）を含む従来手法を全てのタスクで上回りました。さらに分析を行った結果、音声とテキストのモダリティが異なるジェスチャー関連信号を捉えていることが明らかになり、共有の三モーダル埋め込み空間を学習することの利点が強調されました。データセット、モデル、コードは以下で公開されています：https://www.robots.ox.ac.uk/~vgg/research/jegal

English

Co-speech gestures play a vital role in non-verbal communication. In this paper, we introduce a new framework for co-speech gesture understanding in the wild. Specifically, we propose three new tasks and benchmarks to evaluate a model's capability to comprehend gesture-text-speech associations: (i) gesture-based retrieval, (ii) gestured word spotting, and (iii) active speaker detection using gestures. We present a new approach that learns a tri-modal speech-text-video-gesture representation to solve these tasks. By leveraging a combination of global phrase contrastive loss and local gesture-word coupling loss, we demonstrate that a strong gesture representation can be learned in a weakly supervised manner from videos in the wild. Our learned representations outperform previous methods, including large vision-language models (VLMs), across all three tasks. Further analysis reveals that speech and text modalities capture distinct gesture-related signals, underscoring the advantages of learning a shared tri-modal embedding space. The dataset, model, and code are available at: https://www.robots.ox.ac.uk/~vgg/research/jegal

実環境における共話ジェスチャーの理解

Understanding Co-speech Gestures in-the-wild

要旨

Support