意味動作アンカー：共発話ジェスチャーにおける動作と意味の橋渡し

要旨

音声テキストとジェスチャー間の共有表現を学習することは、共発話ジェスチャーの検索、合成、理解において中心的な課題であるが、動きだけでは伝達意図が捉えられない意味的に有意義なジェスチャーに対しては依然として困難が伴う。トランスクリプトと連続的な動作埋め込みとの直接的な対比的アライメントは、低レベルの運動学を過度に強調し、意味的ジェスチャーの象徴的内容を見落とすことが多い。我々は、ジェスチャー動作の物理的形態と伝達意図を捉えた自然言語による抽象化である意味的動作アンカーを提案する。本手法は、3Dジェスチャーを身体・手の動作プリミティブに離散化し、それらを構造化された記述に言語化し、トランスクリプトに接地することで補助的な対比的監督を提供する。BEAT2において、本手法はテキストからジェスチャーへのR@1を直接的なテキスト-動作ベースラインと比較して8.2%向上させ、テキストからジェスチャーおよびジェスチャーからテキストの検索方向において従来の検索手法を上回る。総合的な検索指標を超えて、意味的動作アンカーによる監督は、一般的な動作パターンに陥るのではなく、発話クエリに対して意味的に有意義なジェスチャーを検索するのに役立つ。下流の検索拡張ジェスチャー生成研究では、ユーザーが検索拡張生成ベースラインよりも本手法で検索されたジェスチャーを有意に好むことが示され、意味的に基づいた検索が下流生成において伝達意図をより適切に伝えるジェスチャーにつながることが実証された。

English

Learning a shared representation between spoken text and gesture is central to co-speech gesture retrieval, synthesis, and understanding, but remains challenging for semantically meaningful gestures whose communicative intent is not captured by motion alone. Direct contrastive alignment between transcripts and continuous motion embeddings often overemphasizes low-level kinematics and misses the symbolic content of semantic gestures. We propose semantic motion anchors, natural-language abstractions of gesture motion capturing physical form and communicative intent. Our method discretizes 3D gestures into body-hand motion primitives, verbalizes them into structured descriptions, and grounds them in the transcript to provide auxiliary contrastive supervision. On BEAT2, our method improves text-to-gesture R@1 by 8.2% over a direct text-motion baseline and outperforms prior retrieval approaches on text to gesture and gesture to text retrieval directions. Beyond aggregate retrieval metrics, semantic motion anchor supervision helps retrieve gestures that are semantically meaningful for the spoken query, rather than defaulting to generic motion patterns. A downstream retrieval-augmented gesture generation study showed that users significantly preferred gestures retrieved by our approach over a retrieval-augmented generation baseline, demonstrating that semantically grounded retrieval translates to gestures that better convey communicative intent in downstream generation.