Live2Diff:视频扩散模型中的单向注意力实时流翻译
Live2Diff: Live Stream Translation via Uni-directional Attention in Video Diffusion Models
July 11, 2024
作者: Zhening Xing, Gereon Fox, Yanhong Zeng, Xingang Pan, Mohamed Elgharib, Christian Theobalt, Kai Chen
cs.AI
摘要
大型语言模型在生成流数据(如文本和音频)方面表现出卓越的效果,这要归功于它们的时间单向注意机制,该机制模拟当前标记与先前标记之间的相关性。然而,尽管对实时视频处理的需求不断增长,视频流仍然是一个相对较少探索的领域。当前最先进的视频扩散模型利用双向时间注意力来模拟当前帧与所有周围帧(包括未来帧)之间的相关性,这使它们无法处理流视频。为解决这一问题,我们提出了Live2Diff,这是设计一种具有单向时间注意力的视频扩散模型的首次尝试,专门针对实时流视频翻译。与以往方法相比,我们的方法通过将当前帧与其前导帧以及一些初始预热帧进行相关联,而不涉及任何未来帧,来确保时间一致性和平滑性。此外,我们使用高效的去噪方案,其中包括KV-缓存机制和流水线技术,以促进以交互帧率进行流视频翻译。广泛的实验表明,所提出的注意机制和流水线的有效性,从时间平滑性和/或效率方面优于以往方法。
English
Large Language Models have shown remarkable efficacy in generating streaming
data such as text and audio, thanks to their temporally uni-directional
attention mechanism, which models correlations between the current token and
previous tokens. However, video streaming remains much less explored, despite a
growing need for live video processing. State-of-the-art video diffusion models
leverage bi-directional temporal attention to model the correlations between
the current frame and all the surrounding (i.e. including future) frames, which
hinders them from processing streaming videos. To address this problem, we
present Live2Diff, the first attempt at designing a video diffusion model with
uni-directional temporal attention, specifically targeting live streaming video
translation. Compared to previous works, our approach ensures temporal
consistency and smoothness by correlating the current frame with its
predecessors and a few initial warmup frames, without any future frames.
Additionally, we use a highly efficient denoising scheme featuring a KV-cache
mechanism and pipelining, to facilitate streaming video translation at
interactive framerates. Extensive experiments demonstrate the effectiveness of
the proposed attention mechanism and pipeline, outperforming previous methods
in terms of temporal smoothness and/or efficiency.Summary
AI-Generated Summary