X-Streamer: 오디오-비주얼 상호작용을 통한 통합 인간 세계 모델링

초록

우리는 X-Streamer를 소개한다. 이는 텍스트, 음성, 비디오를 단일 통합 아키텍처 내에서 무한한 상호작용이 가능한 디지털 인간 에이전트를 구축하기 위한 종단간(end-to-end) 멀티모달 인간 세계 모델링 프레임워크이다. 단일 초상화로부터 시작하여, X-Streamer는 스트리밍 멀티모달 입력에 의해 구동되는 실시간, 개방형 비디오 통화를 가능하게 한다. 이 프레임워크의 핵심은 멀티모달 이해와 생성을 통합하는 Thinker-Actor 이중 트랜스포머 아키텍처로, 정적 초상화를 지속적이고 지능적인 오디오비주얼 상호작용으로 변환한다. Thinker 모듈은 스트리밍 사용자 입력을 인지하고 추론하며, 그 숨겨진 상태는 Actor에 의해 실시간으로 동기화된 멀티모달 스트림으로 변환된다. 구체적으로, Thinker는 사전 훈련된 대규모 언어-음성 모델을 활용하고, Actor는 Thinker의 숨겨진 상태에 교차 주의(cross-attend)를 기울여 시간에 맞춰 정렬된 멀티모달 응답을 생성하는 청크 단위 자동회귀 확산 모델을 사용한다. 이 응답은 이산적 텍스트 및 오디오 토큰과 연속적 비디오 잠재 변수로 구성된다. 장기적 안정성을 보장하기 위해, 우리는 시간에 맞춰 정렬된 멀티모달 위치 임베딩을 사용한 청크 간 및 청크 내 주의 메커니즘을 설계하여 세밀한 교차 모달리티 정렬과 컨텍스트 유지를 강화하고, 청크 단위 확산 강제 및 글로벌 아이덴티티 참조로 더욱 강화하였다. X-Streamer는 두 개의 A100 GPU에서 실시간으로 실행되며, 임의의 초상화로부터 수 시간 동안 일관된 비디오 채팅 경험을 유지하고, 상호작용 가능한 디지털 인간의 통합 세계 모델링을 위한 길을 열어준다.

English

We introduce X-Streamer, an end-to-end multimodal human world modeling framework for building digital human agents capable of infinite interactions across text, speech, and video within a single unified architecture. Starting from a single portrait, X-Streamer enables real-time, open-ended video calls driven by streaming multimodal inputs. At its core is a Thinker-Actor dual-transformer architecture that unifies multimodal understanding and generation, turning a static portrait into persistent and intelligent audiovisual interactions. The Thinker module perceives and reasons over streaming user inputs, while its hidden states are translated by the Actor into synchronized multimodal streams in real time. Concretely, the Thinker leverages a pretrained large language-speech model, while the Actor employs a chunk-wise autoregressive diffusion model that cross-attends to the Thinker's hidden states to produce time-aligned multimodal responses with interleaved discrete text and audio tokens and continuous video latents. To ensure long-horizon stability, we design inter- and intra-chunk attentions with time-aligned multimodal positional embeddings for fine-grained cross-modality alignment and context retention, further reinforced by chunk-wise diffusion forcing and global identity referencing. X-Streamer runs in real time on two A100 GPUs, sustaining hours-long consistent video chat experiences from arbitrary portraits and paving the way toward unified world modeling of interactive digital humans.

X-Streamer: 오디오-비주얼 상호작용을 통한 통합 인간 세계 모델링

X-Streamer: Unified Human World Modeling with Audiovisual Interaction

초록

Support