語義動作錨點：橋接伴語手勢中的動作與意義

摘要

學習口語文本與手勢之間的共享表徵，對於共語手勢檢索、合成與理解至關重要，但對於語義上有意義的手勢而言仍具挑戰性，因為其傳達意圖無法僅由動作本身捕捉。直接對齊文本轉錄與連續動作嵌入的對比學習，往往過度強調低層運動學，而忽略了語義手勢的象徵性內容。我們提出語義動作錨點，即手勢動作的自然語言抽象，用以捕捉其物理形態與傳達意圖。我們的方法將三維手勢離散化為身體-手部動作基元，將其口頭化為結構化描述，並將其對應到文本轉錄中，以提供輔助的對比監督。在BEAT2資料集上，我們的方法在文本到手勢的R@1指標上，相較於直接文本-動作基準提升了8.2%，並在文本到手勢與手勢到文本這兩個檢索方向上，優於先前的檢索方法。除了整體檢索指標外，語義動作錨點監督有助於檢索與口語查詢具語義相關性的手勢，而非預設回傳通用動作模式。一項下游的檢索增強手勢生成研究顯示，使用者顯著偏好我們方法檢索到的手勢，勝過檢索增強生成基準，證明了具語義基礎的檢索能轉化為在下游生成中更能傳達溝通意圖的手勢。

English

Learning a shared representation between spoken text and gesture is central to co-speech gesture retrieval, synthesis, and understanding, but remains challenging for semantically meaningful gestures whose communicative intent is not captured by motion alone. Direct contrastive alignment between transcripts and continuous motion embeddings often overemphasizes low-level kinematics and misses the symbolic content of semantic gestures. We propose semantic motion anchors, natural-language abstractions of gesture motion capturing physical form and communicative intent. Our method discretizes 3D gestures into body-hand motion primitives, verbalizes them into structured descriptions, and grounds them in the transcript to provide auxiliary contrastive supervision. On BEAT2, our method improves text-to-gesture R@1 by 8.2% over a direct text-motion baseline and outperforms prior retrieval approaches on text to gesture and gesture to text retrieval directions. Beyond aggregate retrieval metrics, semantic motion anchor supervision helps retrieve gestures that are semantically meaningful for the spoken query, rather than defaulting to generic motion patterns. A downstream retrieval-augmented gesture generation study showed that users significantly preferred gestures retrieved by our approach over a retrieval-augmented generation baseline, demonstrating that semantically grounded retrieval translates to gestures that better convey communicative intent in downstream generation.