扩展零样本参考视频生成技术
Scaling Zero-Shot Reference-to-Video Generation
December 7, 2025
作者: Zijian Zhou, Shikun Liu, Haozhe Liu, Haonan Qiu, Zhaochong An, Weiming Ren, Zhiheng Liu, Xiaoke Huang, Kam Woh Ng, Tian Xie, Xiao Han, Yuren Cong, Hang Li, Chuyan Zhu, Aditya Patel, Tao Xiang, Sen He
cs.AI
摘要
参考到视频(R2V)生成技术旨在根据文本提示合成视频,同时保持参考图像中的主体身份特征。然而,当前R2V方法受限于对显式参考图像-视频-文本三元组的依赖,这类数据的构建成本极高且难以规模化。我们通过引入Saber框架突破这一瓶颈,该可扩展的零样本框架无需显式R2V数据支持。仅基于视频-文本对进行训练的Saber采用掩码训练策略和定制化的基于注意力的模型设计,以学习身份一致且参考感知的表征。我们还整合了掩码增强技术来缓解参考到视频生成中常见的复制粘贴伪影问题。此外,Saber在多样化参考数量下展现出卓越的泛化能力,并在OpenS2V-Eval基准测试中超越了依赖R2V数据训练的方法,实现了更优的性能表现。
English
Reference-to-video (R2V) generation aims to synthesize videos that align with a text prompt while preserving the subject identity from reference images. However, current R2V methods are hindered by the reliance on explicit reference image-video-text triplets, whose construction is highly expensive and difficult to scale. We bypass this bottleneck by introducing Saber, a scalable zero-shot framework that requires no explicit R2V data. Trained exclusively on video-text pairs, Saber employs a masked training strategy and a tailored attention-based model design to learn identity-consistent and reference-aware representations. Mask augmentation techniques are further integrated to mitigate copy-paste artifacts common in reference-to-video generation. Moreover, Saber demonstrates remarkable generalization capabilities across a varying number of references and achieves superior performance on the OpenS2V-Eval benchmark compared to methods trained with R2V data.