SpeContext：基於推測性上下文稀疏性實現LLM高效長文本推理

摘要

本文指出，检索算法的目标是与大语言模型对齐，这与大语言模型知识蒸馏的目标具有相似性。我们从信息论角度分析了蒸馏语言模型与原始大语言模型在信息关注点的相似性，进而提出了一种以蒸馏语言模型作为检索算法的新范式。基于此洞见，我们提出了SpeContext——一种面向长上下文推理的算法与系统协同设计框架。（1）在算法层面，SpeContext基于蒸馏语言模型的头部级注意力权重提出轻量级检索头，通过剪枝冗余参数实现超过90%的参数缩减；（2）在系统层面，通过弹性加载策略设计异步预取数据流，有效实现KV缓存检索与大语言模型计算的并行化；（3）在编译层面，构建理论内存模型并实现自适应内存管理系统，通过最大化GPU内存利用率实现加速。我们在云端和边缘两种资源受限环境中部署评估SpeContext。大量实验表明，相较于Huggingface框架，SpeContext在云端实现最高24.89倍吞吐量提升，在边缘端实现10.06倍加速，且精度损失可忽略不计，推动了精度与吞吐量的帕累托前沿边界。

English

In this paper, we point out that the objective of the retrieval algorithms is to align with the LLM, which is similar to the objective of knowledge distillation in LLMs. We analyze the similarity in information focus between the distilled language model(DLM) and the original LLM from the perspective of information theory, and thus propose a novel paradigm that leverages a DLM as the retrieval algorithm. Based on the insight, we present SpeContext, an algorithm and system co-design for long-context reasoning. (1) At the algorithm level, SpeContext proposes lightweight retrieval head based on the head-level attention weights of DLM, achieving > 90% parameters reduction by pruning the redundancy. (2) At the system level, SpeContext designs an asynchronous prefetch dataflow via the elastic loading strategy, effectively overlapping KV cache retrieval with the LLM computation. (3) At the compilation level, SpeContext constructs the theoretical memory model and implements an adaptive memory management system to achieve acceleration by maximizing GPU memory utilization. We deploy and evaluate SpeContext in two resourceconstrained environments, cloud and edge. Extensive experiments show that, compared with the Huggingface framework, SpeContext achieves up to 24.89x throughput improvement in cloud and 10.06x speedup in edge with negligible accuracy loss, pushing the Pareto frontier of accuracy and throughput.

SpeContext：基於推測性上下文稀疏性實現LLM高效長文本推理

SpeContext: Enabling Efficient Long-context Reasoning with Speculative Context Sparsity in LLMs

摘要

Support