아바타 V: 비디오-참조 아바타 비디오 생성의 확장

초록

타겟 개인과 단순히 시각적으로 유사할 뿐만 아니라 행동적으로도 인식 가능하여, 그들의 말하는 리듬, 제스처 경향, 표정 역학을 충실히 재현하는 아바타 비디오를 생성하는 것은 여전히 해결되지 않은 과제로 남아 있다. 기존 방법들은 주로 단일 정적 이미지에 조건화되는데, 이는 충분한 정체성 정보를 제공하지 못하고 동적 움직임 특성을 포착할 수 없으며, 표준 픽셀 수준의 목표 함수는 아바타 충실도를 결정하는 지각적으로 중요한 안면 영역을 충분히 반영하지 못한다. 우리는 이러한 한계를 비디오 참조 조건화 정체성 모델링을 통해 해결하는 프로덕션 규모의 프레임워크인 Avatar V를 제시한다. 제안된 모델은 정체성을 고정 크기 임베딩으로 압축하는 대신 참조 비디오의 전체 토큰 시퀀스에 직접 조건화하여, 참조 컨텍스트에 대한 어텐션을 통해 정적 정체성 속성(안면 기하학, 피부 질감)과 동적 행동 패턴(말하는 리듬, 미세 표정)을 모두 재현하도록 학습한다. 우리는 비대칭 메커니즘으로 임의 길이의 참조에 대해 선형 복잡도의 조건화를 달성하는 희소 참조 어텐션(Sparse Reference Attention), 폐루프 말투 스타일 전이를 가능하게 하는 움직임 표현 스트림, 그리고 완전한 참조 조건화를 계승하는 정체성 인식 초해상도 리파이너(refiner)를 도입한다. 이러한 구성 요소는 5천만 개의 원본 비디오에서 1억 개 이상의 훈련 클립을 선별하는 데이터 엔진과, 흐름 정합 사전 학습, 성격 미세 조정, 2단계 증류(10배 이상 가속), RLHF 정렬을 포함하는 5단계 훈련 파이프라인에 의해 뒷받침되며, 수천 개의 GPU에 걸쳐 배포된다. Avatar V는 무제한 길이의 1080p 비디오를 생성하며, 당사의 교차 장면 벤치마크에서 최고 수준의 정체성 보존, 입술 동기화 및 생성 품질을 달성하여, 자동 평가 지표와 인간 평가 모두에서 Seedance 2.0, Kling O3 Pro, Veo 3.1, OmniHuman 1.5를 포함한 주요 시스템을 일관되게 능가한다.

English

Generating avatar videos that are not merely visually similar to a target individual but behaviorally recognizable, faithfully reproducing their talking rhythm, gestural tendencies, and expression dynamics, remains an open challenge. Existing methods predominantly condition on single static images, which provide insufficient identity information and cannot capture dynamic motion traits, while standard pixel-level objectives underserve the perceptually critical facial regions that determine avatar fidelity. We present Avatar V, a production-scale framework that addresses these limitations through video-reference-conditioned identity modeling. Rather than compressing identity into fixed-size embeddings, the model conditions directly on the full token sequence of a reference video, learning to reproduce both static identity attributes (facial geometry, skin texture) and dynamic behavioral patterns (talking rhythm, micro-expressions) through attention over the reference context. We introduce Sparse Reference Attention, an asymmetric mechanism achieving linear-complexity conditioning on arbitrarily long references; a motion representation stream enabling closed-loop talking style transfer; and an identity-aware super-resolution refiner inheriting the full reference conditioning. These are supported by a data engine curating 100M+ training clips from 50M raw videos, and a five-stage training pipeline with flow matching pre-training, personality fine-tuning, two-phase distillation (>10x acceleration), and RLHF alignment, deployed across thousands of GPUs. Avatar V generates 1080p videos of unlimited duration, achieving state-of-the-art identity preservation, lip synchronization, and generation quality on our cross-scene benchmark, consistently outperforming leading systems including Seedance 2.0, Kling O3 Pro, Veo 3.1, and OmniHuman 1.5 in both automated metrics and human evaluation.