맥락 내 희소 주의를 통한 통합 비디오 편집

초록

비디오 편집은 인컨텍스트 러닝(ICL) 패러다임으로 진화하고 있지만, 그로 인한 이차적 어텐션 비용이 심각한 계산 병목 현상을 일으키고 있습니다. 본 연구에서는 ICL 비디오 편집에 특화된 최초의 무손실에 가까운 경험적 희소 프레임워크인 인컨텍스트 희소 어텐션(ISA)을 제안합니다. 우리의 설계는 두 가지 핵심 통찰에 기반합니다: 첫째, 컨텍스트 토큰은 소스 토큰에 비해 현저히 낮은 중요도를 보입니다; 둘째, 우리는 이론적으로 증명하고 경험적으로 검증하여 쿼리 선명도가 근사 오차와 상관관계가 있음을 확인했습니다. 이러한 발견에 기반하여 ISA는 중복 컨텍스트를 제거하는 효율적인 사전 선택 전략을 구현하고, 이어서 높은 오차 쿼리는 전체 어텐션으로, 낮은 오차 쿼리는 계산적으로 효율적인 0차 테일러 희소 어텐션으로 라우팅하는 동적 쿼리 그룹화 메커니즘을 적용합니다. 더 나아가, 우리는 ISA와 170만 개의 고품질 데이터셋을 구축한 새로운 비디오 편집 데이터 파이프라인을 통해 혁신적인 라이트닝 비디오 편집 모델인 \texttt{LIVEditor}를 구축했습니다. 광범위한 실험을 통해 LIVEditor는 어텐션 모듈 지연 시간을 약 60% 감소시키면서도 EditVerseBench, IVE-Bench, VIE-Bench에서 최첨단 방법들을 능가하여 시각적 충실도를 저해하지 않으면서 무손실에 가까운 가속을 제공함을 입증했습니다.

English

Video editing has evolved toward In-Context Learning (ICL) paradigms, yet the resulting quadratic attention costs create a critical computational bottleneck. In this work, we propose In-context Sparse Attention (ISA), the first near-lossless empirical sparse framework tailored for ICL video editing. Our design is grounded in two key insights: first, context tokens exhibit significantly lower saliency than source tokens; second, we theoretically prove and empirically validate that Query sharpness correlates with approximation error. Motivated by these findings, ISA implements an efficient pre-selection strategy to prune redundant context, followed by a dynamic query grouping mechanism that routes high-error queries to full attention and low-error ones to a computationally efficient 0-th order Taylor sparse attention. Furthermore, we build \texttt{LIVEditor} , a novel lightning video editing model via ISA and a proposed video-editing data pipeline that curated a 1.7M high-quality dataset. Extensive experiments demonstrate that LIVEditor achieves a sim60% reduction in attention-module latency while surpassing state-of-the-art methods across EditVerseBench, IVE-Bench, and VIE-Bench, delivering near-lossless acceleration without compromising visual fidelity.

맥락 내 희소 주의를 통한 통합 비디오 편집

Lightning Unified Video Editing via In-Context Sparse Attention

초록

Support