ID-LoRA：基于身份驱动的上下文LoRA音视频个性化技术

摘要

现有视频个性化方法虽能保持视觉相似度，却将视频与音频分开处理。由于音频模型无法获取视觉场景信息，难以实现声音与屏幕动作的同步；而传统语音克隆模型仅依赖参考录音进行条件约束，导致文本提示无法调整说话风格或声学环境。我们提出ID-LoRA（身份驱动的上下文LoRA），通过单一模型联合生成主体的外观与声音，使文本提示、参考图像和短音频片段共同调控两种模态。该方法基于参数高效的上下文LoRA对LTX-2联合音视频扩散主干进行适配，据我们所知，这是首个在单次生成过程中同步实现视觉外观与语音个性化的方法。我们面临两大挑战：参考标记与生成标记共享位置编码空间导致难以区分，为此我们采用负时间位置编码，将参考标记置于不相交的RoPE区域同时保留其内部时序结构；此外说话人特征在去噪过程中易被稀释，我们引入身份引导技术——一种无需分类器的引导变体，通过对比有无参考信号时的预测结果来增强说话人特异性特征。在人类偏好研究中，73%的标注者认为ID-LoRA在声音相似度上优于Kling 2.6 Pro，65%认为其说话风格更佳。在跨环境设置下，说话人相似度较Kling提升24%，且环境差异越大优势越明显。初步用户研究表明，联合生成为物理基础的声音合成提供了有效的归纳偏置。ID-LoRA仅需在单GPU上使用约3000个训练对即可达成上述成果。代码、模型及数据将公开发布。

English

Existing video personalization methods preserve visual likeness but treat video and audio separately. Without access to the visual scene, audio models cannot synchronize sounds with on-screen actions; and because classical voice-cloning models condition only on a reference recording, a text prompt cannot redirect speaking style or acoustic environment. We propose ID-LoRA (Identity-Driven In-Context LoRA), which jointly generates a subject's appearance and voice in a single model, letting a text prompt, a reference image, and a short audio clip govern both modalities together. ID-LoRA adapts the LTX-2 joint audio-video diffusion backbone via parameter-efficient In-Context LoRA and, to our knowledge, is the first method to personalize visual appearance and voice in a single generative pass. Two challenges arise. Reference and generation tokens share the same positional-encoding space, making them hard to distinguish; we address this with negative temporal positions, placing reference tokens in a disjoint RoPE region while preserving their internal temporal structure. Speaker characteristics also tend to be diluted during denoising; we introduce identity guidance, a classifier-free guidance variant that amplifies speaker-specific features by contrasting predictions with and without the reference signal. In human preference studies, ID-LoRA is preferred over Kling 2.6 Pro by 73% of annotators for voice similarity and 65% for speaking style. On cross-environment settings, speaker similarity improves by 24% over Kling, with the gap widening as conditions diverge. A preliminary user study further suggests that joint generation provides a useful inductive bias for physically grounded sound synthesis. ID-LoRA achieves these results with only ~3K training pairs on a single GPU. Code, models, and data will be released.

ID-LoRA：基于身份驱动的上下文LoRA音视频个性化技术

ID-LoRA: Identity-Driven Audio-Video Personalization with In-Context LoRA

摘要

Support