回顧：使用特徵庫進行串流影片到影片的翻譯

摘要

本文介紹了StreamV2V，一種實現實時串流影片到影片（V2V）翻譯的擴散模型，並支援使用者提示。與先前使用批次處理有限幀的V2V方法不同，我們選擇以串流方式處理幀，以支援無限幀。StreamV2V的核心是一個向後觀看的原則，將現在與過去相關聯。這是通過維護一個特徵庫來實現的，該庫存檔了過去幀的信息。對於傳入的幀，StreamV2V將自注意力擴展到包括存儲的鍵和值，並將類似的過去特徵直接融入輸出中。特徵庫通過合併存儲和新特徵不斷更新，使其既緊湊又信息豐富。StreamV2V以其適應性和效率脫穎而出，與圖像擴散模型無縫集成，無需微調。它可以在一個A100 GPU上運行20 FPS，比FlowVid、CoDeF、Rerender和TokenFlow分別快15倍、46倍、108倍和158倍。定量指標和用戶研究證實了StreamV2V保持時間一致性的卓越能力。

English

This paper introduces StreamV2V, a diffusion model that achieves real-time streaming video-to-video (V2V) translation with user prompts. Unlike prior V2V methods using batches to process limited frames, we opt to process frames in a streaming fashion, to support unlimited frames. At the heart of StreamV2V lies a backward-looking principle that relates the present to the past. This is realized by maintaining a feature bank, which archives information from past frames. For incoming frames, StreamV2V extends self-attention to include banked keys and values and directly fuses similar past features into the output. The feature bank is continually updated by merging stored and new features, making it compact but informative. StreamV2V stands out for its adaptability and efficiency, seamlessly integrating with image diffusion models without fine-tuning. It can run 20 FPS on one A100 GPU, being 15x, 46x, 108x, and 158x faster than FlowVid, CoDeF, Rerender, and TokenFlow, respectively. Quantitative metrics and user studies confirm StreamV2V's exceptional ability to maintain temporal consistency.

回顧：使用特徵庫進行串流影片到影片的翻譯

Looking Backward: Streaming Video-to-Video Translation with Feature Banks

摘要

Support