RaBiT: Residubewuste Binarisatietraining voor Nauwkeurige en Efficiënte LLM's

Samenvatting

Efficiënte inzet van grote taalmodellen (LLM's) vereist extreme kwantisering, wat een kritische afweging tussen efficiëntie bij lage bits en prestaties afdwingt. Residuele binarisatie maakt hardwarevriendelijke, matmul-vrije inferentie mogelijk door binaire (±1) lagen te stapelen, maar wordt geteisterd door pathologische feature co-adaptatie. Wij identificeren een cruciale foutmodus, die we *inter-pad-adaptatie* noemen: tijdens *Quantization-Aware Training* (QAT) leren parallelle residuele binaire paden redundante features aan, wat de foutcompensatiestructuur degradeert en de expressieve capaciteit van het model beperkt. Terwijl eerder werk vertrouwt op heuristische oplossingen (zoals pad-bevriezing) die de oplossingsruimte beperken, stellen wij RaBiT voor, een nieuw kwantiseringsraamwerk dat co-adaptatie oplost door algoritmisch een residuele hiërarchie af te dwingen. De kernmechanisme leidt elk binair pad sequentieel af uit een enkele gedeelde gewichtenvector met volledige precisie, wat garandeert dat elk pad de fout van het voorgaande corrigeert. Dit proces wordt gestabiliseerd door een robuuste initialisatie die functioneel behoud prioriteert boven loutere gewichtsbenadering. RaBiT herdefinieert de frontlijn voor nauwkeurigheid-efficiëntie bij 2 bits: het behaalt state-of-the-art prestaties, evenaart zelfs hardware-intensieve *Vector Quantization* (VQ)-methoden, en levert een 4,49× versnelling in inferentie op ten opzichte van modellen met volledige precisie op een RTX 4090.

English

Efficient deployment of large language models (LLMs) requires extreme quantization, forcing a critical trade-off between low-bit efficiency and performance. Residual binarization enables hardware-friendly, matmul-free inference by stacking binary (pm1) layers, but is plagued by pathological feature co-adaptation. We identify a key failure mode, which we term inter-path adaptation: during quantization-aware training (QAT), parallel residual binary paths learn redundant features, degrading the error-compensation structure and limiting the expressive capacity of the model. While prior work relies on heuristic workarounds (e.g., path freezing) that constrain the solution space, we propose RaBiT, a novel quantization framework that resolves co-adaptation by algorithmically enforcing a residual hierarchy. Its core mechanism sequentially derives each binary path from a single shared full-precision weight, which ensures that every path corrects the error of the preceding one. This process is stabilized by a robust initialization that prioritizes functional preservation over mere weight approximation. RaBiT redefines the 2-bit accuracy-efficiency frontier: it achieves state-of-the-art performance, rivals even hardware-intensive Vector Quantization (VQ) methods, and delivers a 4.49times inference speed-up over full-precision models on an RTX 4090.

RaBiT: Residubewuste Binarisatietraining voor Nauwkeurige en Efficiënte LLM's

RaBiT: Residual-Aware Binarization Training for Accurate and Efficient LLMs

Samenvatting

Support