回顧:使用特徵庫進行串流影片到影片的翻譯
Looking Backward: Streaming Video-to-Video Translation with Feature Banks
May 24, 2024
作者: Feng Liang, Akio Kodaira, Chenfeng Xu, Masayoshi Tomizuka, Kurt Keutzer, Diana Marculescu
cs.AI
摘要
本文介紹了StreamV2V,一種實現實時串流影片到影片(V2V)翻譯的擴散模型,並支援使用者提示。與先前使用批次處理有限幀的V2V方法不同,我們選擇以串流方式處理幀,以支援無限幀。StreamV2V的核心是一個向後觀看的原則,將現在與過去相關聯。這是通過維護一個特徵庫來實現的,該庫存檔了過去幀的信息。對於傳入的幀,StreamV2V將自注意力擴展到包括存儲的鍵和值,並將類似的過去特徵直接融入輸出中。特徵庫通過合併存儲和新特徵不斷更新,使其既緊湊又信息豐富。StreamV2V以其適應性和效率脫穎而出,與圖像擴散模型無縫集成,無需微調。它可以在一個A100 GPU上運行20 FPS,比FlowVid、CoDeF、Rerender和TokenFlow分別快15倍、46倍、108倍和158倍。定量指標和用戶研究證實了StreamV2V保持時間一致性的卓越能力。
English
This paper introduces StreamV2V, a diffusion model that achieves real-time
streaming video-to-video (V2V) translation with user prompts. Unlike prior V2V
methods using batches to process limited frames, we opt to process frames in a
streaming fashion, to support unlimited frames. At the heart of StreamV2V lies
a backward-looking principle that relates the present to the past. This is
realized by maintaining a feature bank, which archives information from past
frames. For incoming frames, StreamV2V extends self-attention to include banked
keys and values and directly fuses similar past features into the output. The
feature bank is continually updated by merging stored and new features, making
it compact but informative. StreamV2V stands out for its adaptability and
efficiency, seamlessly integrating with image diffusion models without
fine-tuning. It can run 20 FPS on one A100 GPU, being 15x, 46x, 108x, and 158x
faster than FlowVid, CoDeF, Rerender, and TokenFlow, respectively. Quantitative
metrics and user studies confirm StreamV2V's exceptional ability to maintain
temporal consistency.Summary
AI-Generated Summary