의미적 움직임 앵커: 동반 발화 제스처에서 움직임과 의미의 연결

초록

발화 텍스트와 제스처 간의 공유 표현을 학습하는 것은 동시 발화 제스처 검색, 합성 및 이해의 핵심 과제이지만, 운동 자체만으로는 전달 의도가 포착되지 않는 의미론적 제스처의 경우 여전히 어려움을 겪고 있다. 대본과 연속적인 움직임 임베딩 간의 직접적인 대조 정렬은 종종 저수준 운동학을 과도하게 강조하고 의미론적 제스처의 상징적 내용을 놓친다. 본 연구에서는 신체적 형태와 전달 의도를 포착하는 제스처 움직임에 대한 자연어 추상화인 의미론적 움직임 앵커(semantic motion anchors)를 제안한다. 이 방법은 3D 제스처를 신체-손 움직임 프리미티브로 이산화하고, 이를 구조화된 설명으로 언어화하며, 대본에 근거시켜 보조 대조적 지도 신호를 제공한다. BEAT2 데이터셋에서, 본 방법은 직접적인 텍스트-움직임 기준선 대비 텍스트-제스처 R@1을 8.2% 향상시켰으며, 텍스트-제스처 및 제스처-텍스트 검색 방향에서 기존 검색 접근법보다 우수한 성능을 보였다. 전체 검색 지표를 넘어, 의미론적 움직임 앵커 지도는 일반적인 움직임 패턴에 의존하는 대신, 음성 질의에 대해 의미론적으로 적절한 제스처를 검색하는 데 도움을 준다. 하위 단계인 검색 증강 제스처 생성 연구에서는 사용자들이 검색 증강 생성 기준선보다 본 접근법으로 검색된 제스처를 유의미하게 선호하였으며, 이는 의미적으로 근거한 검색이 하위 생성 과정에서 전달 의도를 더 잘 반영하는 제스처로 이어짐을 보여준다.

English

Learning a shared representation between spoken text and gesture is central to co-speech gesture retrieval, synthesis, and understanding, but remains challenging for semantically meaningful gestures whose communicative intent is not captured by motion alone. Direct contrastive alignment between transcripts and continuous motion embeddings often overemphasizes low-level kinematics and misses the symbolic content of semantic gestures. We propose semantic motion anchors, natural-language abstractions of gesture motion capturing physical form and communicative intent. Our method discretizes 3D gestures into body-hand motion primitives, verbalizes them into structured descriptions, and grounds them in the transcript to provide auxiliary contrastive supervision. On BEAT2, our method improves text-to-gesture R@1 by 8.2% over a direct text-motion baseline and outperforms prior retrieval approaches on text to gesture and gesture to text retrieval directions. Beyond aggregate retrieval metrics, semantic motion anchor supervision helps retrieve gestures that are semantically meaningful for the spoken query, rather than defaulting to generic motion patterns. A downstream retrieval-augmented gesture generation study showed that users significantly preferred gestures retrieved by our approach over a retrieval-augmented generation baseline, demonstrating that semantically grounded retrieval translates to gestures that better convey communicative intent in downstream generation.