SVG-EAR：基於誤差感知路由的稀疏影片生成無參數線性補償

摘要

擴散轉換器（DiT）已成為視訊生成的主流骨幹架構，但其二次方複雜度的注意力計算成本仍是主要瓶頸。稀疏注意力通過僅計算部分注意力區塊來降低成本，然而現有方法往往直接捨棄剩餘區塊導致資訊損失，或依賴學習型預測器進行近似，這會引入訓練開銷並可能引發輸出分佈偏移。本文提出無需訓練即可恢復缺失貢獻的方法：經過語義聚類後，各區塊內的鍵值對呈現高度相似性，可透過少量聚類質心有效表徵。基於此發現，我們設計了無參數的線性補償分支SVG-EAR，利用質心近似被跳過的區塊並恢復其貢獻。雖然質心補償對多數區塊準確有效，但在少量區塊上可能失效。傳統稀疏化通常根據注意力分數選擇區塊，該分數反映模型關注重點，卻無法標識近似誤差最大的區域。為此SVG-EAR採用誤差感知路由機制：通過輕量級探測器估算各區塊的補償誤差，精確計算誤差-成本比最高的區塊，同時對跳過區塊進行補償。我們從理論上證明了注意力重建誤差與聚類質量的關聯性，並通過實驗驗證SVG-EAR在視訊擴散任務中能提升質量-效率權衡，在保持生成保真度的同時提高吞吐量。相較既有方法，SVG-EAR建立了顯著的帕累托最優邊界，在Wan2.2和HunyuanVideo數據集上分別實現1.77倍與1.93倍加速，同時維持29.759與31.043的峰值信噪比。

English

Diffusion Transformers (DiTs) have become a leading backbone for video generation, yet their quadratic attention cost remains a major bottleneck. Sparse attention reduces this cost by computing only a subset of attention blocks. However, prior methods often either drop the remaining blocks, which incurs information loss, or rely on learned predictors to approximate them, introducing training overhead and potential output distribution shifting. In this paper, we show that the missing contributions can be recovered without training: after semantic clustering, keys and values within each block exhibit strong similarity and can be well summarized by a small set of cluster centroids. Based on this observation, we introduce SVG-EAR, a parameter-free linear compensation branch that uses the centroid to approximate skipped blocks and recover their contributions. While centroid compensation is accurate for most blocks, it can fail on a small subset. Standard sparsification typically selects blocks by attention scores, which indicate where the model places its attention mass, but not where the approximation error would be largest. SVG-EAR therefore performs error-aware routing: a lightweight probe estimates the compensation error for each block, and we compute exactly the blocks with the highest error-to-cost ratio while compensating for skipped blocks. We provide theoretical guarantees that relate attention reconstruction error to clustering quality, and empirically show that SVG-EAR improves the quality-efficiency trade-off and increases throughput at the same generation fidelity on video diffusion tasks. Overall, SVG-EAR establishes a clear Pareto frontier over prior approaches, achieving up to 1.77times and 1.93times speedups while maintaining PSNRs of up to 29.759 and 31.043 on Wan2.2 and HunyuanVideo, respectively.

SVG-EAR：基於誤差感知路由的稀疏影片生成無參數線性補償

SVG-EAR: Parameter-Free Linear Compensation for Sparse Video Generation via Error-aware Routing

摘要

Support