희소 프론티어: 트랜스포머 LLM에서의 희소 주의력 트레이드오프

초록

희소 주의(Sparse attention)는 Transformer 대형 언어 모델(LLM)의 장문맥 처리 능력을 확장하기 위한 유망한 전략을 제공하지만, 그 실행 가능성, 효율성-정확성 간의 상충 관계, 그리고 체계적인 스케일링 연구는 아직 탐구되지 않았습니다. 이러한 격차를 해소하기 위해, 우리는 다양한 모델 규모, 시퀀스 길이, 그리고 희소성 수준에서의 학습 없이 적용 가능한 희소 주의 방법들을 신중하게 비교 분석했습니다. 이 비교는 자연어를 기반으로 하면서도 통제 가능하고 평가가 쉬운 새로운 장문맥 작업들을 포함한 다양한 작업 집합에서 수행되었습니다. 우리의 실험을 바탕으로 다음과 같은 주요 발견들을 보고합니다: 1) isoFLOPS 분석 결과, 매우 긴 시퀀스의 경우 더 크고 높은 희소성을 가진 모델이 더 작고 밀도 높은 모델보다 선호됨을 확인했습니다. 2) 정확도 보존을 통계적으로 보장할 수 있는 희소성 수준은 디코딩 단계에서 프리필링 단계보다 높으며, 전자의 경우 모델 크기와 상관관계가 있습니다. 3) 모든 작업과 단계에서 최고의 성능을 보이는 명확한 전략은 없으며, 서로 다른 시나리오에서는 다른 희소화 단위나 예산 적응성이 필요합니다. 심지어 중간 수준의 희소성도 적어도 하나의 작업에서 상당한 성능 저하를 초래하는 경우가 많아, 희소 주의가 보편적인 해결책이 아님을 강조합니다. 4) 우리는 희소 주의에 특화된 새로운 스케일링 법칙을 소개하고 검증하여, 우리의 발견이 실험 범위를 넘어서도 유효할 가능성이 높음을 입증했습니다. 이러한 통찰을 통해, 희소 주의가 Transformer LLM의 장문맥 처리 능력을 향상시키는 핵심 도구임을 보여주지만, 성능에 민감한 애플리케이션에서는 상충 관계를 신중히 평가해야 함을 입증했습니다.

English

Sparse attention offers a promising strategy to extend long-context capabilities in Transformer LLMs, yet its viability, its efficiency-accuracy trade-offs, and systematic scaling studies remain unexplored. To address this gap, we perform a careful comparison of training-free sparse attention methods at varying model scales, sequence lengths, and sparsity levels on a diverse collection of long-sequence tasks-including novel ones that rely on natural language while remaining controllable and easy to evaluate. Based on our experiments, we report a series of key findings: 1) an isoFLOPS analysis reveals that for very long sequences, larger and highly sparse models are preferable to smaller and dense ones. 2) The level of sparsity attainable while statistically guaranteeing accuracy preservation is higher during decoding than prefilling, and correlates with model size in the former. 3) There is no clear strategy that performs best across tasks and phases, with different units of sparsification or budget adaptivity needed for different scenarios. Even moderate sparsity levels often result in significant performance degradation on at least one task, highlighting that sparse attention is not a universal solution. 4) We introduce and validate novel scaling laws specifically tailored for sparse attention, providing evidence that our findings are likely to hold true beyond our range of experiments. Through these insights, we demonstrate that sparse attention is a key tool to enhance the capabilities of Transformer LLMs for processing longer sequences, but requires careful evaluation of trade-offs for performance-sensitive applications.

희소 프론티어: 트랜스포머 LLM에서의 희소 주의력 트레이드오프

The Sparse Frontier: Sparse Attention Trade-offs in Transformer LLMs

초록

Support