Flux Attention: Contextbewuste Hybride Aandacht voor Efficiënte LLM-inferentie

Samenvatting

De kwadratische computationele complexiteit van standaard aandachtmechanismen vormt een ernstige schaalbaarheidsbeperking voor grote taalmodel(len) in scenario's met lange context. Hoewel hybride aandachtmechanismen die Volledige Aandacht (VA) en Sparse Aandacht (SA) combineren een mogelijke oplossing bieden, zijn bestaande methoden doorgaans gebaseerd op statische toewijzingsverhoudingen die niet kunnen voldoen aan de variabele retrievalbehoeften van verschillende taken. Bovendien introduceert dynamische sparse aandacht op het niveau van aandachtskoppen vaak een ernstige onevenwichtige rekenlast en synchronisatielange staarten, wat hardwareversnelling tijdens autoregressieve decodering belemmert. Om deze kloof te overbruggen, introduceren wij Flux Aandacht, een contextbewust raamwerk dat de aandachtberekening dynamisch optimaliseert op het laagniveau. Door een lichtgewicht Laagrouter te integreren in bevroren, vooraf getrainde grote taalmodel(len), routeert de voorgestelde methode elke laag adaptief naar VA of SA op basis van de invoercontext. Deze routing per laag behoudt hoogwaardige informatie-retrieval en zorgt tegelijkertijd voor aaneengesloten geheugentoegang, wat theoretische rekenreducties vertaalt naar praktische snelheidswinst in werkelijke tijd. Als een parameter-efficiënte aanpak vereist ons raamwerk slechts 12 uur training op 8xA800 GPU's. Uitgebreide experimenten op meerdere benchmarks voor lange context en wiskundig redeneren tonen aan dat Flux Aandacht een superieure balans bereikt tussen prestaties en inferentiesnelheid in vergelijking met baseline-modellen, met snelheidsverbeteringen tot 2,8x en 2,0x in respectievelijk de prefill- en decodeerfasen.

English

The quadratic computational complexity of standard attention mechanisms presents a severe scalability bottleneck for LLMs in long-context scenarios. While hybrid attention mechanisms combining Full Attention (FA) and Sparse Attention (SA) offer a potential solution, existing methods typically rely on static allocation ratios that fail to accommodate the variable retrieval demands of different tasks. Furthermore, head-level dynamic sparsity often introduces severe computational load imbalance and synchronization long-tails, which hinder hardware acceleration during autoregressive decoding. To bridge this gap, we introduce Flux Attention, a context-aware framework that dynamically optimizes attention computation at the layer level. By integrating a lightweight Layer Router into frozen pretrained LLMs, the proposed method adaptively routes each layer to FA or SA based on the input context. This layer-wise routing preserves high-fidelity information retrieval while ensuring contiguous memory access, translating theoretical computational reductions into practical wall-clock speedups. As a parameter-efficient approach, our framework requires only 12 hours of training on 8timesA800 GPUs. Extensive experiments across multiple long-context and mathematical reasoning benchmarks demonstrate that Flux Attention achieves a superior trade-off between performance and inference speed compared with baseline models, with speed improvements of up to 2.8times and 2.0times in the prefill and decode stages.

Flux Attention: Contextbewuste Hybride Aandacht voor Efficiënte LLM-inferentie

Flux Attention: Context-Aware Hybrid Attention for Efficient LLMs Inference

Samenvatting

Support