回顾:使用特征库进行视频到视频的流式翻译
Looking Backward: Streaming Video-to-Video Translation with Feature Banks
May 24, 2024
作者: Feng Liang, Akio Kodaira, Chenfeng Xu, Masayoshi Tomizuka, Kurt Keutzer, Diana Marculescu
cs.AI
摘要
本文介绍了StreamV2V,这是一个实现实时流视频到视频(V2V)翻译的扩散模型,用户可以提供提示。与先前使用批处理处理有限帧的V2V方法不同,我们选择以流式方式处理帧,以支持无限帧。StreamV2V的核心是一个将当前与过去相关联的向后看原则。这是通过维护一个特征库来实现的,该库存档了来自过去帧的信息。对于传入的帧,StreamV2V将自注意力扩展到包括存储的键和值,并将类似的过去特征直接融合到输出中。特征库通过合并存储的和新的特征不断更新,使其既紧凑又信息丰富。StreamV2V以其适应性和效率脱颖而出,可以无需微调即与图像扩散模型无缝集成。它可以在一个A100 GPU上以20 FPS运行,比FlowVid、CoDeF、Rerender和TokenFlow分别快15倍、46倍、108倍和158倍。定量指标和用户研究证实了StreamV2V在保持时间一致性方面的卓越能力。
English
This paper introduces StreamV2V, a diffusion model that achieves real-time
streaming video-to-video (V2V) translation with user prompts. Unlike prior V2V
methods using batches to process limited frames, we opt to process frames in a
streaming fashion, to support unlimited frames. At the heart of StreamV2V lies
a backward-looking principle that relates the present to the past. This is
realized by maintaining a feature bank, which archives information from past
frames. For incoming frames, StreamV2V extends self-attention to include banked
keys and values and directly fuses similar past features into the output. The
feature bank is continually updated by merging stored and new features, making
it compact but informative. StreamV2V stands out for its adaptability and
efficiency, seamlessly integrating with image diffusion models without
fine-tuning. It can run 20 FPS on one A100 GPU, being 15x, 46x, 108x, and 158x
faster than FlowVid, CoDeF, Rerender, and TokenFlow, respectively. Quantitative
metrics and user studies confirm StreamV2V's exceptional ability to maintain
temporal consistency.Summary
AI-Generated Summary