Live2Diff:透過視訊擴散模型中的單向注意力進行直播串流翻譯
Live2Diff: Live Stream Translation via Uni-directional Attention in Video Diffusion Models
July 11, 2024
作者: Zhening Xing, Gereon Fox, Yanhong Zeng, Xingang Pan, Mohamed Elgharib, Christian Theobalt, Kai Chen
cs.AI
摘要
大型語言模型在生成流式數據(如文本和音頻)方面表現出卓越的效能,這要歸功於它們的時間上單向注意機制,該機制模擬了當前標記與先前標記之間的相關性。然而,儘管對於即時視頻處理的需求不斷增長,視頻流仍然遠未被充分探索。當前的視頻擴散模型利用雙向時間注意力來模擬當前幀與所有周圍幀(包括未來幀)之間的相關性,這使它們無法處理流式視頻。為解決這一問題,我們提出了Live2Diff,這是設計具有時間上單向注意力的視頻擴散模型的首次嘗試,專門針對即時流式視頻翻譯。與先前的研究相比,我們的方法通過將當前幀與其前幾個先行幀以及一些初始的預熱幀相關聯,而不涉及任何未來幀,確保了時間一致性和平滑性。此外,我們使用了一種高效的去噪方案,其中包括KV-緩存機制和流水線處理,以實現互動幀速率下的流式視頻翻譯。大量實驗證明了所提出的注意機制和流程的有效性,優於先前方法在時間平滑性和/或效率方面的表現。
English
Large Language Models have shown remarkable efficacy in generating streaming
data such as text and audio, thanks to their temporally uni-directional
attention mechanism, which models correlations between the current token and
previous tokens. However, video streaming remains much less explored, despite a
growing need for live video processing. State-of-the-art video diffusion models
leverage bi-directional temporal attention to model the correlations between
the current frame and all the surrounding (i.e. including future) frames, which
hinders them from processing streaming videos. To address this problem, we
present Live2Diff, the first attempt at designing a video diffusion model with
uni-directional temporal attention, specifically targeting live streaming video
translation. Compared to previous works, our approach ensures temporal
consistency and smoothness by correlating the current frame with its
predecessors and a few initial warmup frames, without any future frames.
Additionally, we use a highly efficient denoising scheme featuring a KV-cache
mechanism and pipelining, to facilitate streaming video translation at
interactive framerates. Extensive experiments demonstrate the effectiveness of
the proposed attention mechanism and pipeline, outperforming previous methods
in terms of temporal smoothness and/or efficiency.Summary
AI-Generated Summary