Elke Aandacht Telt: Een Efficiënte Hybride Architectuur voor Redeneren over Lange Contexten

Samenvatting

In dit technische rapport presenteren we de Ring-linear modelreeks, met name Ring-mini-linear-2.0 en Ring-flash-linear-2.0. Ring-mini-linear-2.0 bestaat uit 16B parameters en 957M activaties, terwijl Ring-flash-linear-2.0 104B parameters en 6.1B activaties bevat. Beide modellen gebruiken een hybride architectuur die lineaire aandacht en softmax-aandacht effectief integreert, waardoor de I/O- en rekenkosten aanzienlijk worden verminderd in langetermijn-inferentiescenario's. Vergeleken met een dicht model van 32 miljard parameters, reduceert deze reeks de inferentiekosten tot 1/10, en vergeleken met de originele Ring-reeks zijn de kosten ook met meer dan 50% verlaagd. Bovendien hebben we door systematische exploratie van de verhouding tussen verschillende aandachtmechanismen in de hybride architectuur de huidige optimale modelstructuur geïdentificeerd. Daarnaast is door het gebruik van onze zelfontwikkelde high-performance FP8 operatorbibliotheek - linghe - de algehele trainings efficiëntie met 50% verbeterd. Dankzij de hoge afstemming tussen de trainings- en inferentie-engine-operators kunnen de modellen tijdens de reinforcement learning-fase langdurig, stabiel en zeer efficiënt worden geoptimaliseerd, waardoor ze consistent state-of-the-art (SOTA) prestaties behouden op meerdere uitdagende complexe redeneerbenchmarks.

English

In this technical report, we present the Ring-linear model series, specifically including Ring-mini-linear-2.0 and Ring-flash-linear-2.0. Ring-mini-linear-2.0 comprises 16B parameters and 957M activations, while Ring-flash-linear-2.0 contains 104B parameters and 6.1B activations. Both models adopt a hybrid architecture that effectively integrates linear attention and softmax attention, significantly reducing I/O and computational overhead in long-context inference scenarios. Compared to a 32 billion parameter dense model, this series reduces inference cost to 1/10, and compared to the original Ring series, the cost is also reduced by over 50%. Furthermore, through systematic exploration of the ratio between different attention mechanisms in the hybrid architecture, we have identified the currently optimal model structure. Additionally, by leveraging our self-developed high-performance FP8 operator library-linghe, overall training efficiency has been improved by 50%. Benefiting from the high alignment between the training and inference engine operators, the models can undergo long-term, stable, and highly efficient optimization during the reinforcement learning phase, consistently maintaining SOTA performance across multiple challenging complex reasoning benchmarks.

Elke Aandacht Telt: Een Efficiënte Hybride Architectuur voor Redeneren over Lange Contexten

Every Attention Matters: An Efficient Hybrid Architecture for Long-Context Reasoning

Samenvatting

Support