ChatPaper.aiChatPaper

ReHyAt:面向视频扩散变换器的循环混合注意力机制

ReHyAt: Recurrent Hybrid Attention for Video Diffusion Transformers

January 7, 2026
作者: Mohsen Ghafoorian, Amirhossein Habibian
cs.AI

摘要

近期视频扩散模型的研究趋势已转向基于Transformer的架构,虽能实现最先进的视频生成效果,但需承受二次方注意力复杂度的代价,这严重限制了长序列的可扩展性。我们提出ReHyAt——一种融合softmax注意力保真度与线性注意力效率的循环混合注意力机制,支持分块循环重构并实现恒定内存占用。与同期仅采用线性注意力的SANA Video不同,ReHyAt的混合设计支持从现有softmax模型进行高效蒸馏,将训练成本降低两个数量级至约160 GPU小时,同时保持质量竞争力。我们的轻量级蒸馏微调流程为未来基于双向softmax的顶尖模型提供了可复用的方案。在VBench和VBench-2.0上的实验及人类偏好研究表明,ReHyAt在将注意力成本从二次方降至线性的同时,实现了最先进的视频生成质量,为长时序视频生成及端侧部署提供了实用化扩展能力。项目页面详见https://qualcomm-ai-research.github.io/rehyat。
English
Recent advances in video diffusion models have shifted towards transformer-based architectures, achieving state-of-the-art video generation but at the cost of quadratic attention complexity, which severely limits scalability for longer sequences. We introduce ReHyAt, a Recurrent Hybrid Attention mechanism that combines the fidelity of softmax attention with the efficiency of linear attention, enabling chunk-wise recurrent reformulation and constant memory usage. Unlike the concurrent linear-only SANA Video, ReHyAt's hybrid design allows efficient distillation from existing softmax-based models, reducing the training cost by two orders of magnitude to ~160 GPU hours, while being competitive in the quality. Our light-weight distillation and finetuning pipeline provides a recipe that can be applied to future state-of-the-art bidirectional softmax-based models. Experiments on VBench and VBench-2.0, as well as a human preference study, demonstrate that ReHyAt achieves state-of-the-art video quality while reducing attention cost from quadratic to linear, unlocking practical scalability for long-duration and on-device video generation. Project page is available at https://qualcomm-ai-research.github.io/rehyat.
PDF14January 10, 2026