ARIG:面向实时对话的自回归式交互头部生成
ARIG: Autoregressive Interactive Head Generation for Real-time Conversations
July 1, 2025
作者: Ying Guo, Xi Liu, Cheng Zhen, Pengfei Yan, Xiaoming Wei
cs.AI
摘要
面对面交流作为人类常见的互动形式,推动了交互式头部生成技术的研究。虚拟代理能够基于自身及对方的音频或动作信号,生成兼具聆听与说话能力的动态响应。然而,先前的片段式生成范式或显式的听者/说者生成器切换方法,在未来信号获取、上下文行为理解及切换流畅性方面存在局限,难以实现实时且逼真的交互。本文提出了一种基于自回归(AR)的逐帧生成框架——ARIG,旨在实现更佳交互真实感的实时生成。为实现实时生成,我们将动作预测建模为非向量量化的自回归过程。与离散码本索引预测不同,我们采用扩散过程表示动作分布,从而在连续空间中获得更精确的预测。为提升交互真实感,我们着重于交互行为理解(IBU)与详细对话状态理解(CSU)。在IBU中,基于双轨双模态信号,通过双向集成学习总结短程行为,并对长程上下文进行理解。在CSU中,利用语音活动信号及IBU的上下文特征,理解实际对话中存在的多种状态(如打断、反馈、暂停等),这些作为最终渐进式动作预测的条件。大量实验验证了我们模型的有效性。
English
Face-to-face communication, as a common human activity, motivates the
research on interactive head generation. A virtual agent can generate motion
responses with both listening and speaking capabilities based on the audio or
motion signals of the other user and itself. However, previous clip-wise
generation paradigm or explicit listener/speaker generator-switching methods
have limitations in future signal acquisition, contextual behavioral
understanding, and switching smoothness, making it challenging to be real-time
and realistic. In this paper, we propose an autoregressive (AR) based
frame-wise framework called ARIG to realize the real-time generation with
better interaction realism. To achieve real-time generation, we model motion
prediction as a non-vector-quantized AR process. Unlike discrete codebook-index
prediction, we represent motion distribution using diffusion procedure,
achieving more accurate predictions in continuous space. To improve interaction
realism, we emphasize interactive behavior understanding (IBU) and detailed
conversational state understanding (CSU). In IBU, based on dual-track
dual-modal signals, we summarize short-range behaviors through
bidirectional-integrated learning and perform contextual understanding over
long ranges. In CSU, we use voice activity signals and context features of IBU
to understand the various states (interruption, feedback, pause, etc.) that
exist in actual conversations. These serve as conditions for the final
progressive motion prediction. Extensive experiments have verified the
effectiveness of our model.