SVG-EAR：基于误差感知路由的稀疏视频生成无参数线性补偿方法

摘要

扩散变换器（DiT）已成为视频生成的主流架构，但其二次方注意力计算成本仍是主要瓶颈。稀疏注意力通过仅计算部分注意力块来降低开销，然而现有方法往往直接丢弃剩余块导致信息损失，或依赖训练预测器进行近似，这会引入训练开销并可能改变输出分布。本文提出无需训练即可恢复缺失贡献的方法：经过语义聚类后，每个注意力块内的键值对表现出强相似性，可通过少量聚类中心点有效概括。基于此发现，我们设计了SVG-EAR——一种无参数的线性补偿分支，利用中心点近似被跳过的注意力块并恢复其贡献。虽然中心点补偿对多数块具有较高精度，但在少量块上可能失效。传统稀疏化通常根据注意力分数选择计算块，该分数仅反映模型关注区域，无法指示近似误差最大的位置。为此SVG-EAR采用误差感知路由机制：通过轻量级探针估计每个块的补偿误差，精确计算误差-成本比最高的块，同时对跳过块进行补偿。我们建立了注意力重建误差与聚类质量的理论关联，并在视频扩散任务上验证SVG-EAR可提升质量-效率权衡，在保持生成保真度的同时提高吞吐量。实验表明SVG-EAR在Wan2.2和HunyuanVideo数据集上分别实现1.77倍和1.93倍加速，同时维持29.759与31.043的峰值信噪比，显著优于现有方法的帕累托边界。

English

Diffusion Transformers (DiTs) have become a leading backbone for video generation, yet their quadratic attention cost remains a major bottleneck. Sparse attention reduces this cost by computing only a subset of attention blocks. However, prior methods often either drop the remaining blocks, which incurs information loss, or rely on learned predictors to approximate them, introducing training overhead and potential output distribution shifting. In this paper, we show that the missing contributions can be recovered without training: after semantic clustering, keys and values within each block exhibit strong similarity and can be well summarized by a small set of cluster centroids. Based on this observation, we introduce SVG-EAR, a parameter-free linear compensation branch that uses the centroid to approximate skipped blocks and recover their contributions. While centroid compensation is accurate for most blocks, it can fail on a small subset. Standard sparsification typically selects blocks by attention scores, which indicate where the model places its attention mass, but not where the approximation error would be largest. SVG-EAR therefore performs error-aware routing: a lightweight probe estimates the compensation error for each block, and we compute exactly the blocks with the highest error-to-cost ratio while compensating for skipped blocks. We provide theoretical guarantees that relate attention reconstruction error to clustering quality, and empirically show that SVG-EAR improves the quality-efficiency trade-off and increases throughput at the same generation fidelity on video diffusion tasks. Overall, SVG-EAR establishes a clear Pareto frontier over prior approaches, achieving up to 1.77times and 1.93times speedups while maintaining PSNRs of up to 29.759 and 31.043 on Wan2.2 and HunyuanVideo, respectively.