Parallax: 언어 모델링을 위한 매개변수화된 지역 선형 어텐션

초록

대규모 언어 모델(LLM)은 인공지능의 핵심 패러다임으로 자리 잡았지만, 주의 메커니즘(attention)의 핵심 계산 원형은 구조적으로 변하지 않은 상태로 남아 있다. 국소 선형 주의(Local Linear Attention, LLA)는 테스트-시간 회귀(test-time regression) 프레임워크에서 비모수 통계로부터 도출된 주의 메커니즘이다. 기존의 효율적인 주의 변형 연구와 달리, LLA는 소프트맥스 주의(softmax attention)에서의 국소 상수 추정을 국소 선형 추정으로 업그레이드하여 연상 기억(associative memory)에 대해 증명 가능하게 우수한 편향-분산 절충(bias-variance tradeoff)을 제공한다. 그러나 LLA는 계산 및 수치 안정성 문제로 인해 LLM 사전학습에서 확장되지 못했다. 우리는 LLM에 대해 확장 가능한 매개변수화된 국소 선형 주의인 Parallax를 소개한다. Parallax는 LLA의 수치 해법기를 제거하고 KV 공분산을 탐색하는 추가적인 쿼리 유사 프로젝터(projector)를 학습한다. Parallax를 대역폭(bandwidth), 프로브 구성(probe construction), 아핀 구조(affine structure)로 연결된 주의 메커니즘군(family) 내에 위치시킨다. 우리는 FlashAttention보다 연산 강도(arithmetic intensity)를 높여 주의를 더욱 연산-지배(compute bound) 영역으로 전환하는 하드웨어 인식 알고리즘을 제안한다. 우리의 프로토타입 디코딩 커널은 다양한 배치 크기와 컨텍스트 길이에서 FlashAttention 2/3와 일치하거나 이를 능가한다. 0.6B 및 1.7B 규모에서 Parallax를 사전학습한 결과, 사전학습 전반에 걸쳐 일관된 혼란도(perplexity) 개선이 관찰되었으며, 그 이점은 하류 벤치마크(downstream benchmarks)로 전이되었다. 이러한 우위는 매개변수-일치 및 연산-일치 통제 하에서도 유지되어 파레토 개선(Pareto improvement)을 입증한다. 우리는 주의 깊은 사전학습 절제 실험(ablations)을 수행했으며, Muon이 Parallax의 용량을 잠금 해제하는 새로운 현상을 식별했다. 우리가 아는 한, 이는 구조 연구 문헌에서 주의 메커니즘에 대한 강력한 구조-최적화기 공동설계(architecture-optimizer codesign)의 첫 번째 실증적 시연이다.

English

Large Language Models (LLMs) have become the central paradigm in artificial intelligence, yet the core computational primitive of attention has remained structurally unchanged. Local Linear Attention (LLA) is an attention mechanism derived from nonparametric statistics in the test-time regression framework. In contrast to prior research on efficient attention variants, LLA upgrades the local constant estimate in softmax attention to a local linear estimate, yielding provably superior bias-variance tradeoffs for associative memory. However, LLA has not been scaled in LLM pretraining due to computational and numerical stability concerns. We introduce Parallax, a parameterized Local Linear Attention that is scalable for LLMs. Parallax eliminates the numerical solver in LLA and learns an extra query-like projector that probes the KV covariance. We place Parallax within a family of attention mechanisms connected by the bandwidth, the probe construction and the affine structure. We propose a hardware-aware algorithm that increases the arithmetic intensity over FlashAttention, shifting attention into a more compute bound regime. Our prototype decode kernel matches or outperforms FlashAttention 2/3 across diverse batch sizes and context lengths. We pretrain Parallax at 0.6B and 1.7B scales and find consistent perplexity improvements throughout pretraining with gains that transfer to downstream benchmarks. The advantage persists under both parameter-matched and compute-matched controls, demonstrating a Pareto improvement. We perform careful pretraining ablations and identify a novel phenomenon whereby Muon unlocks the capacity of Parallax. To our knowledge, this is the first empirical demonstration of strong architecture-optimizer codesign for attention mechanisms in the architecture research literature.