语义运动锚点：桥接伴随言语手势中的运动与意义

摘要

在口语文本与手势之间学习共享表示，对于共语手势的检索、生成与理解至关重要，但对于语义有意义的手势而言，其交际意图无法仅通过运动捕捉，这使得该任务仍具有挑战性。转录文本与连续运动嵌入之间的直接对比对齐往往过度强调低级运动学特征，而忽略了语义手势的符号内容。我们提出了语义运动锚点，即手势运动的自然语言抽象，用于捕捉物理形式与交际意图。该方法将三维手势离散化为体手运动基元，将其转化为结构化描述，并基于转录文本进行接地，从而提供辅助对比监督。在BEAT2数据集上，与直接文本-运动基线相比，我们的方法将文本到手势的R@1提升了8.2%，并在文本到手势和手势到文本的检索方向上均优于现有检索方法。除聚合检索指标外，语义运动锚点监督有助于检索与口语查询语义匹配的手势，而非默认选择通用运动模式。一项下游检索增强手势生成研究表明，用户显著偏好由我们方法检索得到的手势（相较于检索增强生成基线），这表明语义接地检索能够转化为在下游生成中更有效传达交际意图的手势。

English

Learning a shared representation between spoken text and gesture is central to co-speech gesture retrieval, synthesis, and understanding, but remains challenging for semantically meaningful gestures whose communicative intent is not captured by motion alone. Direct contrastive alignment between transcripts and continuous motion embeddings often overemphasizes low-level kinematics and misses the symbolic content of semantic gestures. We propose semantic motion anchors, natural-language abstractions of gesture motion capturing physical form and communicative intent. Our method discretizes 3D gestures into body-hand motion primitives, verbalizes them into structured descriptions, and grounds them in the transcript to provide auxiliary contrastive supervision. On BEAT2, our method improves text-to-gesture R@1 by 8.2% over a direct text-motion baseline and outperforms prior retrieval approaches on text to gesture and gesture to text retrieval directions. Beyond aggregate retrieval metrics, semantic motion anchor supervision helps retrieve gestures that are semantically meaningful for the spoken query, rather than defaulting to generic motion patterns. A downstream retrieval-augmented gesture generation study showed that users significantly preferred gestures retrieved by our approach over a retrieval-augmented generation baseline, demonstrating that semantically grounded retrieval translates to gestures that better convey communicative intent in downstream generation.