替身：一種輕量級即插即用的身份控制視頻生成技術

摘要

生成與用戶指定身份相匹配的高保真人類視頻在生成式AI領域中既重要又具挑戰性。現有方法通常依賴過多的訓練參數，且與其他AIGC工具的兼容性不足。本文提出Stand-In，一個輕量級即插即用的框架，用於視頻生成中的身份保持。具體而言，我們在預訓練的視頻生成模型中引入了一個條件圖像分支。通過帶有條件位置映射的限制性自注意力機制實現身份控制，並且僅需2000對數據即可快速學習。儘管僅引入並訓練了約1%的額外參數，我們的框架在視頻質量和身份保持方面取得了優異成果，超越了其他全參數訓練方法。此外，我們的框架還能無縫整合到其他任務中，如主體驅動視頻生成、姿勢參考視頻生成、風格化以及面部替換。

English

Generating high-fidelity human videos that match user-specified identities is important yet challenging in the field of generative AI. Existing methods often rely on an excessive number of training parameters and lack compatibility with other AIGC tools. In this paper, we propose Stand-In, a lightweight and plug-and-play framework for identity preservation in video generation. Specifically, we introduce a conditional image branch into the pre-trained video generation model. Identity control is achieved through restricted self-attentions with conditional position mapping, and can be learned quickly with only 2000 pairs. Despite incorporating and training just sim1\% additional parameters, our framework achieves excellent results in video quality and identity preservation, outperforming other full-parameter training methods. Moreover, our framework can be seamlessly integrated for other tasks, such as subject-driven video generation, pose-referenced video generation, stylization, and face swapping.