回顾：使用特征库进行视频到视频的流式翻译

摘要

本文介绍了StreamV2V，这是一个实现实时流视频到视频（V2V）翻译的扩散模型，用户可以提供提示。与先前使用批处理处理有限帧的V2V方法不同，我们选择以流式方式处理帧，以支持无限帧。StreamV2V的核心是一个将当前与过去相关联的向后看原则。这是通过维护一个特征库来实现的，该库存档了来自过去帧的信息。对于传入的帧，StreamV2V将自注意力扩展到包括存储的键和值，并将类似的过去特征直接融合到输出中。特征库通过合并存储的和新的特征不断更新，使其既紧凑又信息丰富。StreamV2V以其适应性和效率脱颖而出，可以无需微调即与图像扩散模型无缝集成。它可以在一个A100 GPU上以20 FPS运行，比FlowVid、CoDeF、Rerender和TokenFlow分别快15倍、46倍、108倍和158倍。定量指标和用户研究证实了StreamV2V在保持时间一致性方面的卓越能力。

English

This paper introduces StreamV2V, a diffusion model that achieves real-time streaming video-to-video (V2V) translation with user prompts. Unlike prior V2V methods using batches to process limited frames, we opt to process frames in a streaming fashion, to support unlimited frames. At the heart of StreamV2V lies a backward-looking principle that relates the present to the past. This is realized by maintaining a feature bank, which archives information from past frames. For incoming frames, StreamV2V extends self-attention to include banked keys and values and directly fuses similar past features into the output. The feature bank is continually updated by merging stored and new features, making it compact but informative. StreamV2V stands out for its adaptability and efficiency, seamlessly integrating with image diffusion models without fine-tuning. It can run 20 FPS on one A100 GPU, being 15x, 46x, 108x, and 158x faster than FlowVid, CoDeF, Rerender, and TokenFlow, respectively. Quantitative metrics and user studies confirm StreamV2V's exceptional ability to maintain temporal consistency.

回顾：使用特征库进行视频到视频的流式翻译

Looking Backward: Streaming Video-to-Video Translation with Feature Banks

摘要

Support