コンテキスト内スパースアテンションによる高速統合ビデオ編集

要旨

映像編集はIn-Context Learning（ICL）パラダイムへと進化しているが、それに伴う二次関数的な注意コストが重大な計算ボトルネックを生み出している。本研究では、ICL映像編集に特化した初のニアロスレス経験的スパースフレームワークであるIn-context Sparse Attention（ISA）を提案する。我々の設計は二つの重要な知見に基づいている：第一に、コンテキストトークンはソーストークンよりも有意に低い顕著性を示すこと；第二に、クエリのシャープネスが近似誤差と相関することを理論的に証明し実証的に検証したこと。これらの発見に動機付けられ、ISAは冗長なコンテキストを剪定する効率的な事前選択戦略を実装し、続いて高誤差クエリは完全注意へ、低誤差クエリは計算効率の良い0次テイラースパース注意へルーティングする動的クエリグループ化メカニズムを採用する。さらに、ISAと提案した映像編集データパイプライン（170万の高品質データセットを精選）を通じて、新規の高速映像編集モデル\texttt{LIVEditor}を構築した。大規模な実験により、LIVEditorが注意モジュールのレイテンシを約60%削減しつつ、EditVerseBench、IVE-Bench、VIE-Benchにおいて最新手法を凌駕し、視覚的品質を損なうことなくニアロスレスな高速化を実現することを実証した。

English

Video editing has evolved toward In-Context Learning (ICL) paradigms, yet the resulting quadratic attention costs create a critical computational bottleneck. In this work, we propose In-context Sparse Attention (ISA), the first near-lossless empirical sparse framework tailored for ICL video editing. Our design is grounded in two key insights: first, context tokens exhibit significantly lower saliency than source tokens; second, we theoretically prove and empirically validate that Query sharpness correlates with approximation error. Motivated by these findings, ISA implements an efficient pre-selection strategy to prune redundant context, followed by a dynamic query grouping mechanism that routes high-error queries to full attention and low-error ones to a computationally efficient 0-th order Taylor sparse attention. Furthermore, we build \texttt{LIVEditor} , a novel lightning video editing model via ISA and a proposed video-editing data pipeline that curated a 1.7M high-quality dataset. Extensive experiments demonstrate that LIVEditor achieves a sim60% reduction in attention-module latency while surpassing state-of-the-art methods across EditVerseBench, IVE-Bench, and VIE-Bench, delivering near-lossless acceleration without compromising visual fidelity.

コンテキスト内スパースアテンションによる高速統合ビデオ編集

Lightning Unified Video Editing via In-Context Sparse Attention

要旨

Support