Avatar V: 動画参照アバター動画生成のスケーリング

要旨

対象人物と外見的に類似しているだけでなく、行動的にも認識可能であり、その話すリズム、身振りの傾向、表情のダイナミクスを忠実に再現するアバタービデオの生成は、依然として未解決の課題である。既存手法は主に単一の静止画像に依存しており、提供される同一性情報が不十分で、動的な動作特性を捉えることができない。また、標準的なピクセルレベルの目的関数は、アバターの忠実度を決定する知覚的に重要な顔領域に対して不十分な対応しかできない。我々は、ビデオ参照条件付き同一性モデリングを通じてこれらの限界に対処する、プロダクション規模のフレームワークAvatar Vを提案する。本モデルは、同一性を固定サイズの埋め込みに圧縮するのではなく、参照ビデオの完全なトークン系列に直接条件付けを行い、参照コンテキストに対するアテンションを通じて、静的な同一性属性（顔の形状、肌の質感）と動的な行動パターン（話すリズム、微表情）の両方を再現することを学習する。我々は、任意に長い参照に対して線形複雑度の条件付けを実現する非対称機構であるスパース参照アテンション、閉ループの話し方スタイル転送を可能にする動作表現ストリーム、そして完全な参照条件付けを継承する同一性認識超解像リファイナを導入する。これらは、5000万以上の生ビデオから1億以上のトレーニングクリップをキュレーションするデータエンジンと、フローマッチング事前学習、パーソナリティファインチューニング、二段階蒸留（10倍以上の高速化）、およびRLHFアライメントからなる5段階のトレーニングパイプラインによって支えられ、数千のGPUにわたって展開される。Avatar Vは無制限の長さの1080pビデオを生成し、我々のクロスシーンベンチマークにおいて、最先端の同一性保存、リップシンク、および生成品質を達成し、自動評価と人間評価の両方でSeedance 2.0、Kling O3 Pro、Veo 3.1、OmniHuman 1.5を含む主要システムを一貫して上回る。

English

Generating avatar videos that are not merely visually similar to a target individual but behaviorally recognizable, faithfully reproducing their talking rhythm, gestural tendencies, and expression dynamics, remains an open challenge. Existing methods predominantly condition on single static images, which provide insufficient identity information and cannot capture dynamic motion traits, while standard pixel-level objectives underserve the perceptually critical facial regions that determine avatar fidelity. We present Avatar V, a production-scale framework that addresses these limitations through video-reference-conditioned identity modeling. Rather than compressing identity into fixed-size embeddings, the model conditions directly on the full token sequence of a reference video, learning to reproduce both static identity attributes (facial geometry, skin texture) and dynamic behavioral patterns (talking rhythm, micro-expressions) through attention over the reference context. We introduce Sparse Reference Attention, an asymmetric mechanism achieving linear-complexity conditioning on arbitrarily long references; a motion representation stream enabling closed-loop talking style transfer; and an identity-aware super-resolution refiner inheriting the full reference conditioning. These are supported by a data engine curating 100M+ training clips from 50M raw videos, and a five-stage training pipeline with flow matching pre-training, personality fine-tuning, two-phase distillation (>10x acceleration), and RLHF alignment, deployed across thousands of GPUs. Avatar V generates 1080p videos of unlimited duration, achieving state-of-the-art identity preservation, lip synchronization, and generation quality on our cross-scene benchmark, consistently outperforming leading systems including Seedance 2.0, Kling O3 Pro, Veo 3.1, and OmniHuman 1.5 in both automated metrics and human evaluation.