ChatPaper.aiChatPaper

迈向无缝交互:交互式3D对话头部动态的因果层级建模

Towards Seamless Interaction: Causal Turn-Level Modeling of Interactive 3D Conversational Head Dynamics

December 17, 2025
作者: Junjie Chen, Fei Wang, Zhihao Huang, Qing Zhou, Kun Li, Dan Guo, Linfeng Zhang, Xun Yang
cs.AI

摘要

人类对话包含言语与非语言信号的持续交流,如传递注意力和情绪的点头、视线转移及面部表情。在三维空间中建模这种双向动态对于构建具有表现力的虚拟形象和交互式机器人至关重要。然而现有框架常将说话与倾听视为独立过程,或依赖非因果的全序列建模,导致跨对话轮次的时间连贯性受阻。我们提出TIMAR(轮次级交错掩码自回归)框架,该因果推理框架通过将对话建模为交错的多模态上下文,实现三维对话头部动作生成。它在每个轮次内融合多模态信息,应用轮次级因果注意力积累对话历史,同时采用轻量级扩散头预测连续三维头部动态,兼顾协调性与表现力变化。在DualTalk基准测试中,TIMAR将测试集的弗雷谢距离和均方误差降低15-30%,在分布外数据上也取得相近提升。源代码将发布于GitHub仓库https://github.com/CoderChen01/towards-seamleass-interaction。
English
Human conversation involves continuous exchanges of speech and nonverbal cues such as head nods, gaze shifts, and facial expressions that convey attention and emotion. Modeling these bidirectional dynamics in 3D is essential for building expressive avatars and interactive robots. However, existing frameworks often treat talking and listening as independent processes or rely on non-causal full-sequence modeling, hindering temporal coherence across turns. We present TIMAR (Turn-level Interleaved Masked AutoRegression), a causal framework for 3D conversational head generation that models dialogue as interleaved audio-visual contexts. It fuses multimodal information within each turn and applies turn-level causal attention to accumulate conversational history, while a lightweight diffusion head predicts continuous 3D head dynamics that captures both coordination and expressive variability. Experiments on the DualTalk benchmark show that TIMAR reduces Fréchet Distance and MSE by 15-30% on the test set, and achieves similar gains on out-of-distribution data. The source code will be released in the GitHub repository https://github.com/CoderChen01/towards-seamleass-interaction.
PDF02December 19, 2025