ChatPaper.aiChatPaper

擴展零樣本參考影片生成技術

Scaling Zero-Shot Reference-to-Video Generation

December 7, 2025
作者: Zijian Zhou, Shikun Liu, Haozhe Liu, Haonan Qiu, Zhaochong An, Weiming Ren, Zhiheng Liu, Xiaoke Huang, Kam Woh Ng, Tian Xie, Xiao Han, Yuren Cong, Hang Li, Chuyan Zhu, Aditya Patel, Tao Xiang, Sen He
cs.AI

摘要

參考到視頻(R2V)生成技術旨在根據文本提示合成視頻,同時保持參考圖像中的主體身份特徵。然而,現有R2V方法依賴於需要顯式構建的「參考圖像-視頻-文本」三元組數據,此類數據集的構建成本極高且難以規模化。為突破此瓶頸,我們提出Saber——一種無需顯式R2V數據即可實現的可擴展零樣本框架。該框架僅需視頻-文本對進行訓練,通過掩碼訓練策略與定制的基於注意力機制的模型設計,學習身份一致性與參考感知的表徵能力。我們進一步整合掩碼增強技術,以緩解參考到視頻生成中常見的複製貼上偽影問題。值得注意的是,Saber在面對不同數量參考圖像時展現出卓越的泛化能力,並在OpenS2V-Eval基準測試中表現優於依賴R2V數據訓練的方法。
English
Reference-to-video (R2V) generation aims to synthesize videos that align with a text prompt while preserving the subject identity from reference images. However, current R2V methods are hindered by the reliance on explicit reference image-video-text triplets, whose construction is highly expensive and difficult to scale. We bypass this bottleneck by introducing Saber, a scalable zero-shot framework that requires no explicit R2V data. Trained exclusively on video-text pairs, Saber employs a masked training strategy and a tailored attention-based model design to learn identity-consistent and reference-aware representations. Mask augmentation techniques are further integrated to mitigate copy-paste artifacts common in reference-to-video generation. Moreover, Saber demonstrates remarkable generalization capabilities across a varying number of references and achieves superior performance on the OpenS2V-Eval benchmark compared to methods trained with R2V data.
PDF274December 10, 2025