Lynx：高忠実度パーソナライズドビデオ生成に向けて

要旨

本研究では、単一の入力画像からパーソナライズされたビデオを合成するための高忠実度モデル「Lynx」を提案する。Lynxは、オープンソースのDiffusion Transformer（DiT）基盤モデルをベースとして構築され、2つの軽量アダプターを導入することで、アイデンティティの忠実性を確保する。IDアダプターは、Perceiver Resamplerを用いてArcFaceから導出された顔埋め込みをコンパクトなアイデンティティトークンに変換し、条件付けを行う。一方、Refアダプターは、凍結された参照パスウェイから得られた密なVAE特徴を統合し、クロスアテンションを通じてすべてのトランスフォーマーレイヤーに微細な詳細を注入する。これらのモジュールは、時間的整合性と視覚的リアリズムを維持しながら、堅牢なアイデンティティ保存を可能にする。40名の被験者と20の無作為なプロンプトから構成された厳選されたベンチマーク（800のテストケース）による評価を通じて、Lynxは優れた顔の類似性、競争力のあるプロンプト追従、および高いビデオ品質を実証し、パーソナライズされたビデオ生成の技術を進展させた。

English

We present Lynx, a high-fidelity model for personalized video synthesis from a single input image. Built on an open-source Diffusion Transformer (DiT) foundation model, Lynx introduces two lightweight adapters to ensure identity fidelity. The ID-adapter employs a Perceiver Resampler to convert ArcFace-derived facial embeddings into compact identity tokens for conditioning, while the Ref-adapter integrates dense VAE features from a frozen reference pathway, injecting fine-grained details across all transformer layers through cross-attention. These modules collectively enable robust identity preservation while maintaining temporal coherence and visual realism. Through evaluation on a curated benchmark of 40 subjects and 20 unbiased prompts, which yielded 800 test cases, Lynx has demonstrated superior face resemblance, competitive prompt following, and strong video quality, thereby advancing the state of personalized video generation.

Lynx：高忠実度パーソナライズドビデオ生成に向けて

Lynx: Towards High-Fidelity Personalized Video Generation

要旨

Support