ARIG：實時對話中的自回歸互動頭像生成技術

摘要

面對面交流作為人類常見的活動，激發了對互動頭部生成技術的研究。虛擬代理能夠根據自身及對方用戶的音頻或動作信號，生成具備聆聽與說話能力的動作響應。然而，先前的片段式生成範式或顯式的聽者/說者生成器切換方法，在未來信號獲取、上下文行為理解及切換流暢性方面存在局限，難以實現實時且逼真的效果。本文提出了一種基於自迴歸（AR）的逐幀框架——ARIG，旨在實現更佳互動真實性的實時生成。為達成實時生成，我們將動作預測建模為非向量量化的自迴歸過程。與離散碼本索引預測不同，我們利用擴散過程來表示動作分佈，從而在連續空間中實現更精確的預測。為提升互動真實性，我們強調互動行為理解（IBU）與細緻對話狀態理解（CSU）。在IBU中，基於雙軌雙模態信號，我們通過雙向整合學習總結短程行為，並進行長程的上下文理解。在CSU中，我們利用語音活動信號及IBU的上下文特徵，來理解實際對話中存在的多種狀態（如打斷、反饋、停頓等），這些作為最終漸進式動作預測的條件。大量實驗驗證了我們模型的有效性。

English

Face-to-face communication, as a common human activity, motivates the research on interactive head generation. A virtual agent can generate motion responses with both listening and speaking capabilities based on the audio or motion signals of the other user and itself. However, previous clip-wise generation paradigm or explicit listener/speaker generator-switching methods have limitations in future signal acquisition, contextual behavioral understanding, and switching smoothness, making it challenging to be real-time and realistic. In this paper, we propose an autoregressive (AR) based frame-wise framework called ARIG to realize the real-time generation with better interaction realism. To achieve real-time generation, we model motion prediction as a non-vector-quantized AR process. Unlike discrete codebook-index prediction, we represent motion distribution using diffusion procedure, achieving more accurate predictions in continuous space. To improve interaction realism, we emphasize interactive behavior understanding (IBU) and detailed conversational state understanding (CSU). In IBU, based on dual-track dual-modal signals, we summarize short-range behaviors through bidirectional-integrated learning and perform contextual understanding over long ranges. In CSU, we use voice activity signals and context features of IBU to understand the various states (interruption, feedback, pause, etc.) that exist in actual conversations. These serve as conditions for the final progressive motion prediction. Extensive experiments have verified the effectiveness of our model.

ARIG：實時對話中的自回歸互動頭像生成技術

ARIG: Autoregressive Interactive Head Generation for Real-time Conversations

摘要

Support