링크스(Lynx): 고품질 개인화 비디오 생성을 위한 접근

초록

단일 입력 이미지로부터 개인화된 비디오를 합성하기 위한 고품질 모델인 Lynx를 소개한다. 오픈소스 Diffusion Transformer(DiT) 기반 모델을 기반으로 구축된 Lynx는 두 가지 경량 어댑터를 도입하여 신원 충실도를 보장한다. ID-어댑터는 ArcFace에서 추출된 얼굴 임베딩을 컴팩트한 신원 토큰으로 변환하기 위해 Perceiver Resampler를 사용하여 조건화를 수행하며, Ref-어댑터는 고정된 참조 경로에서 추출된 밀집 VAE 특징을 통합하여 교차 주의를 통해 모든 트랜스포머 계층에 걸쳐 세밀한 디테일을 주입한다. 이러한 모듈들은 시간적 일관성과 시각적 현실감을 유지하면서도 강력한 신원 보존을 가능하게 한다. 40명의 대상과 20개의 편향되지 않은 프롬프트로 구성된 벤치마크에서 800개의 테스트 케이스를 통해 평가한 결과, Lynx는 우수한 얼굴 유사성, 경쟁력 있는 프롬프트 준수, 그리고 강력한 비디오 품질을 입증함으로써 개인화된 비디오 생성 기술의 발전을 이끌었다.

English

We present Lynx, a high-fidelity model for personalized video synthesis from a single input image. Built on an open-source Diffusion Transformer (DiT) foundation model, Lynx introduces two lightweight adapters to ensure identity fidelity. The ID-adapter employs a Perceiver Resampler to convert ArcFace-derived facial embeddings into compact identity tokens for conditioning, while the Ref-adapter integrates dense VAE features from a frozen reference pathway, injecting fine-grained details across all transformer layers through cross-attention. These modules collectively enable robust identity preservation while maintaining temporal coherence and visual realism. Through evaluation on a curated benchmark of 40 subjects and 20 unbiased prompts, which yielded 800 test cases, Lynx has demonstrated superior face resemblance, competitive prompt following, and strong video quality, thereby advancing the state of personalized video generation.

링크스(Lynx): 고품질 개인화 비디오 생성을 위한 접근

Lynx: Towards High-Fidelity Personalized Video Generation

초록

Support