ARIG: 실시간 대화를 위한 자기회귀적 상호작용 헤드 생성

초록

대면 커뮤니케이션은 인간의 일반적인 활동으로서, 상호작용적 헤드 생성에 대한 연구를 촉진한다. 가상 에이전트는 다른 사용자와 자신의 오디오 또는 모션 신호를 기반으로 청취 및 발화 능력을 모두 갖춘 모션 응답을 생성할 수 있다. 그러나 기존의 클립 단위 생성 패러다임이나 명시적인 청취자/발화자 생성기 전환 방법은 미래 신호 획득, 맥락적 행동 이해, 전환의 부드러움에 있어 한계가 있어 실시간 및 현실감을 달성하기 어렵다. 본 논문에서는 더 나은 상호작용 현실감을 갖춘 실시간 생성을 구현하기 위해 ARIG라는 이름의 자기회귀(AR) 기반 프레임 단위 프레임워크를 제안한다. 실시간 생성을 달성하기 위해, 우리는 모션 예측을 비벡터 양자화된 AR 프로세스로 모델링한다. 이산 코드북 인덱스 예측과 달리, 우리는 확산 절차를 사용하여 모션 분포를 표현함으로써 연속 공간에서 더 정확한 예측을 달성한다. 상호작용 현실감을 향상시키기 위해, 우리는 상호작용 행동 이해(IBU)와 상세한 대화 상태 이해(CSU)를 강조한다. IBU에서는 듀얼 트랙 듀얼 모달 신호를 기반으로, 양방향 통합 학습을 통해 단기 행동을 요약하고 장기간에 걸친 맥락적 이해를 수행한다. CSU에서는 음성 활동 신호와 IBU의 맥락 특징을 사용하여 실제 대화에 존재하는 다양한 상태(중단, 피드백, 일시 정지 등)를 이해한다. 이들은 최종적인 점진적 모션 예측을 위한 조건으로 작용한다. 광범위한 실험을 통해 우리 모델의 효과성을 검증하였다.

English

Face-to-face communication, as a common human activity, motivates the research on interactive head generation. A virtual agent can generate motion responses with both listening and speaking capabilities based on the audio or motion signals of the other user and itself. However, previous clip-wise generation paradigm or explicit listener/speaker generator-switching methods have limitations in future signal acquisition, contextual behavioral understanding, and switching smoothness, making it challenging to be real-time and realistic. In this paper, we propose an autoregressive (AR) based frame-wise framework called ARIG to realize the real-time generation with better interaction realism. To achieve real-time generation, we model motion prediction as a non-vector-quantized AR process. Unlike discrete codebook-index prediction, we represent motion distribution using diffusion procedure, achieving more accurate predictions in continuous space. To improve interaction realism, we emphasize interactive behavior understanding (IBU) and detailed conversational state understanding (CSU). In IBU, based on dual-track dual-modal signals, we summarize short-range behaviors through bidirectional-integrated learning and perform contextual understanding over long ranges. In CSU, we use voice activity signals and context features of IBU to understand the various states (interruption, feedback, pause, etc.) that exist in actual conversations. These serve as conditions for the final progressive motion prediction. Extensive experiments have verified the effectiveness of our model.

ARIG: 실시간 대화를 위한 자기회귀적 상호작용 헤드 생성

ARIG: Autoregressive Interactive Head Generation for Real-time Conversations

초록

Support