Lynx：迈向高保真个性化视频生成

摘要

我们推出Lynx，一款基于单张输入图像实现高保真个性化视频合成的模型。Lynx建立在开源扩散变换器（DiT）基础模型之上，引入了两个轻量级适配器以确保身份保真度。其中，ID适配器采用感知重采样器，将ArcFace提取的面部嵌入转换为紧凑的身份令牌用于条件控制；而Ref适配器则整合了来自冻结参考路径的密集VAE特征，通过跨注意力机制在所有变换器层中注入细粒度细节。这些模块共同作用，在保持时间连贯性和视觉真实感的同时，实现了鲁棒的身份保持。通过在包含40个主体和20个无偏提示的精选基准上进行评估，共生成800个测试案例，Lynx展现了卓越的面部相似度、具有竞争力的提示跟随能力以及强大的视频质量，从而推动了个性化视频生成技术的进步。

English

We present Lynx, a high-fidelity model for personalized video synthesis from a single input image. Built on an open-source Diffusion Transformer (DiT) foundation model, Lynx introduces two lightweight adapters to ensure identity fidelity. The ID-adapter employs a Perceiver Resampler to convert ArcFace-derived facial embeddings into compact identity tokens for conditioning, while the Ref-adapter integrates dense VAE features from a frozen reference pathway, injecting fine-grained details across all transformer layers through cross-attention. These modules collectively enable robust identity preservation while maintaining temporal coherence and visual realism. Through evaluation on a curated benchmark of 40 subjects and 20 unbiased prompts, which yielded 800 test cases, Lynx has demonstrated superior face resemblance, competitive prompt following, and strong video quality, thereby advancing the state of personalized video generation.

Lynx：迈向高保真个性化视频生成

Lynx: Towards High-Fidelity Personalized Video Generation

摘要

Support