Contextcompressie via Expliciete Informatieoverdracht

Samenvatting

Lang-context inferentie met Large Language Models (LLM's) is kostbaar vanwege de kwadratische aandacht en groeiende key-value caches, wat de motivatie vormt voor contextcompressie. In dit werk bestuderen we zachte contextcompressie, waarbij een lange context wordt samengevat in een kleine set continue representaties. Bestaande methoden hergebruiken typisch de LLM zelf als trainbare compressor, waarbij wordt vertrouwd op laag-voor-laag self-attention om informatie iteratief te aggregeren. Wij stellen dat dit paradigma lijdt onder twee structurele beperkingen: (i) progressieve overschrijving van representaties tussen lagen, en (ii) ongecoördineerde allocatie van compressiecapaciteit over tokens. Wij stellen ComprExIT voor (Contextcompressie via Expliciete Informatie Transmissie), een lichtgewicht raamwerk dat zachte compressie formuleert in een nieuw paradigma: expliciete informatieoverdracht over bevroren LLM-verborgen toestanden. Dit ontkoppelt compressie van de interne self-attention-dynamiek van het model. ComprExIT voert (i) dieptegewijze transmissie uit om selectief informatie uit meerdere lagen over te dragen naar token-ankers, waardoor progressieve overschrijving wordt verminderd, en (ii) breedtegewijze transmissie om ankers te aggregeren in een klein aantal slots via een globaal geoptimaliseerd transmissieplan, wat een gecoördineerde allocatie van informatie waarborgt. Over zes vraag-antwoordbenchmarks presteert ComprExIT consistent beter dan state-of-the-art contextcompressiemethoden, terwijl slechts ~1% extra parameters worden geïntroduceerd. Dit demonstreert dat expliciete en gecoördineerde informatieoverdracht effectievere en robuustere lang-contextcompressie mogelijk maakt.

English

Long-context inference with Large Language Models (LLMs) is costly due to quadratic attention and growing key-value caches, motivating context compression. In this work, we study soft context compression, where a long context is condensed into a small set of continuous representations. Existing methods typically re-purpose the LLM itself as a trainable compressor, relying on layer-by-layer self-attention to iteratively aggregate information. We argue that this paradigm suffers from two structural limitations: (i) progressive representation overwriting across layers (ii) uncoordinated allocation of compression capacity across tokens. We propose ComprExIT (Context Compression via Explicit Information Transmission), a lightweight framework that formulates soft compression into a new paradigm: explicit information transmission over frozen LLM hidden states. This decouples compression from the model's internal self-attention dynamics. ComprExIT performs (i) depth-wise transmission to selectively transmit multi-layer information into token anchors, mitigating progressive overwriting, and (ii) width-wise transmission to aggregate anchors into a small number of slots via a globally optimized transmission plan, ensuring coordinated allocation of information. Across six question-answering benchmarks, ComprExIT consistently outperforms state-of-the-art context compression methods while introducing only ~1% additional parameters, demonstrating that explicit and coordinated information transmission enables more effective and robust long-context compression.