ID-LoRA: インコンテキストLoRAによるアイデンティティ駆動型映像・音声パーソナライゼーション

要旨

既存の動画パーソナライゼーション手法は視覚的類似性を保持するが、映像と音声を別々に扱う。視覚シーンへのアクセスがないため、音声モデルは音と画面上の動作を同期できない。また、従来の音声クローニングモデルは参照録音のみを条件付けるため、テキストプロンプトで話し方や音響環境を変更できない。本研究ではID-LoRA（Identity-Driven In-Context LoRA）を提案する。これは被写体の外見と声を単一モデルで共同生成し、テキストプロンプト、参照画像、短い音声クリップが両モダリティを同時に制御する。ID-LoRAはLTX-2共同音声-映像拡散基盤をパラメータ効率の良いIn-Context LoRAで適応し、知る限りでは単一の生成パスで視覚的外観と声をパーソナライズする初の手法である。2つの課題が生じる。参照トークンと生成トークンが同一の位置符号化空間を共有するため区別が困難であることに対し、負の時間位置を用いて、参照トークンを内部の時間構造を保ちつつ互いに素なRoPE領域に配置する。また話者特性が脱ノイズ過程で希薄化しがちな問題には、アイデンティティガイダンス（参照信号の有無による予測を対比させ話者特有の特徴を増幅するClassifier-Free Guidanceの変種）を導入する。人間による選好評価では、音声類似性で73%、話し方で65%の評価者がKling 2.6 ProよりID-LoRAを選好した。クロス環境設定では、Klingより話者類似性が24%向上し、条件が異なるほど差が拡大した。予備的用户調査は、共同生成が物理に基づいた音響合成に有用な帰納バイアスを提供することを示唆する。ID-LoRAは単一GPUで約3,000訓練ペアのみでこれらの結果を達成する。コード、モデル、データを公開予定である。

English

Existing video personalization methods preserve visual likeness but treat video and audio separately. Without access to the visual scene, audio models cannot synchronize sounds with on-screen actions; and because classical voice-cloning models condition only on a reference recording, a text prompt cannot redirect speaking style or acoustic environment. We propose ID-LoRA (Identity-Driven In-Context LoRA), which jointly generates a subject's appearance and voice in a single model, letting a text prompt, a reference image, and a short audio clip govern both modalities together. ID-LoRA adapts the LTX-2 joint audio-video diffusion backbone via parameter-efficient In-Context LoRA and, to our knowledge, is the first method to personalize visual appearance and voice in a single generative pass. Two challenges arise. Reference and generation tokens share the same positional-encoding space, making them hard to distinguish; we address this with negative temporal positions, placing reference tokens in a disjoint RoPE region while preserving their internal temporal structure. Speaker characteristics also tend to be diluted during denoising; we introduce identity guidance, a classifier-free guidance variant that amplifies speaker-specific features by contrasting predictions with and without the reference signal. In human preference studies, ID-LoRA is preferred over Kling 2.6 Pro by 73% of annotators for voice similarity and 65% for speaking style. On cross-environment settings, speaker similarity improves by 24% over Kling, with the gap widening as conditions diverge. A preliminary user study further suggests that joint generation provides a useful inductive bias for physically grounded sound synthesis. ID-LoRA achieves these results with only ~3K training pairs on a single GPU. Code, models, and data will be released.

ID-LoRA: インコンテキストLoRAによるアイデンティティ駆動型映像・音声パーソナライゼーション

ID-LoRA: Identity-Driven Audio-Video Personalization with In-Context LoRA

要旨

Support