ID-LoRA: 인-컨텍스트 LoRA를 활용한 정체성 기반 오디오-비디오 개인화

초록

기존 비디오 개인화 방법은 시각적 유사성을 보존하지만 비디오와 오디오를 별도로 처리합니다. 시각적 장면 정보에 접근할 수 없는 오디오 모델은 화면 속 행동과 사운드를 동기화할 수 없으며, 기존 음성 복제 모델은 참조 녹음 파일에만 조건을 두기 때문에 텍스트 프롬프트로 발화 스타일이나 음향 환경을 제어할 수 없습니다. 본 연구에서는 단일 모델에서 피사체의 외모와 음성을 함께 생성하여 텍스트 프롬프트, 참조 이미지, 짧은 오디오 클립이 두 양식을 함께 통제하도록 하는 ID-LoRA(Identity-Driven In-Context LoRA)를 제안합니다. ID-LoRA는 매개변수 효율적인 In-Context LoRA를 통해 LTX-2 연동 오디오-비디오 디퓨전 백본을 적용하며, 저자가 알기로는 단일 생성 과정으로 시각적 외모와 음성을 개인화하는 최초의 방법입니다. 여기에는 두 가지 과제가 발생합니다. 참조 토큰과 생성 토큰이 동일한 위치 인코딩 공간을 공유하여 구분이 어려운 문제는, 참조 토큰의 내부 시간 구조를 보존하면서 이들을 분리된 RoPE 영역에 배치하는 음의 시간 위치를 통해 해결합니다. 또한 화자 특성이 잡음 제거 과정에서 희석되는 경향은, 참조 신호의 유무에 따른 예측을 대조하여 화자별 특성을 증폭하는 classifier-free guidance 변형인 identity guidance로 해결합니다. 인간 선호도 연구에서 ID-LoRA는 음성 유사성에 대해 Kling 2.6 Pro보다 73%, 발화 스타일에서는 65%의 주석자에게 선호되었습니다. 교차 환경 설정에서 화자 유사성은 Kling 대비 24% 향상되었으며, 조건이 달라질수록 격차는 커졌습니다. 예비 사용자 연구는 또한 연동 생성이 물리적으로 타당한 사운드 합성에 유용한 귀납적 편향을 제공함을 시사합니다. ID-LoRA는 단일 GPU로 약 3,000개의 훈련 쌍만으로 이러한 결과를 달성합니다. 코드, 모델 및 데이터는 공개될 예정입니다.

English

Existing video personalization methods preserve visual likeness but treat video and audio separately. Without access to the visual scene, audio models cannot synchronize sounds with on-screen actions; and because classical voice-cloning models condition only on a reference recording, a text prompt cannot redirect speaking style or acoustic environment. We propose ID-LoRA (Identity-Driven In-Context LoRA), which jointly generates a subject's appearance and voice in a single model, letting a text prompt, a reference image, and a short audio clip govern both modalities together. ID-LoRA adapts the LTX-2 joint audio-video diffusion backbone via parameter-efficient In-Context LoRA and, to our knowledge, is the first method to personalize visual appearance and voice in a single generative pass. Two challenges arise. Reference and generation tokens share the same positional-encoding space, making them hard to distinguish; we address this with negative temporal positions, placing reference tokens in a disjoint RoPE region while preserving their internal temporal structure. Speaker characteristics also tend to be diluted during denoising; we introduce identity guidance, a classifier-free guidance variant that amplifies speaker-specific features by contrasting predictions with and without the reference signal. In human preference studies, ID-LoRA is preferred over Kling 2.6 Pro by 73% of annotators for voice similarity and 65% for speaking style. On cross-environment settings, speaker similarity improves by 24% over Kling, with the gap widening as conditions diverge. A preliminary user study further suggests that joint generation provides a useful inductive bias for physically grounded sound synthesis. ID-LoRA achieves these results with only ~3K training pairs on a single GPU. Code, models, and data will be released.

ID-LoRA: 인-컨텍스트 LoRA를 활용한 정체성 기반 오디오-비디오 개인화

ID-LoRA: Identity-Driven Audio-Video Personalization with In-Context LoRA

초록

Support