对长上下文推理的提示压缩方法进行特征化
Characterizing Prompt Compression Methods for Long Context Inference
July 11, 2024
作者: Siddharth Jha, Lutfi Eren Erdogan, Sehoon Kim, Kurt Keutzer, Amir Gholami
cs.AI
摘要
长上下文推理在系统层面上带来挑战,增加了计算和内存需求,并且从准确性的角度来看,能够对长上下文进行推理也是具有挑战性的。最近,已经提出了几种方法来压缩提示以减少上下文长度。然而,对比不同提出的方法在不同任务中进行标准化分析的工作很少。这导致了矛盾的结果。为了解决这个问题,我们在这里对不同的提示压缩方法进行了全面的表征和评估。具体来说,我们分析了抽取式压缩、基于摘要的生成式压缩和标记修剪方法。令人惊讶的是,我们发现抽取式压缩通常优于所有其他方法,并且能够实现高达10倍的压缩,准确性下降最小。有趣的是,我们还发现尽管最近有几项声称,标记修剪方法通常落后于抽取式压缩。我们在摘要任务上只发现了轻微的改进。
English
Long context inference presents challenges at the system level with increased
compute and memory requirements, as well as from an accuracy perspective in
being able to reason over long contexts. Recently, several methods have been
proposed to compress the prompt to reduce the context length. However, there
has been little work on comparing the different proposed methods across
different tasks through a standardized analysis. This has led to conflicting
results. To address this, here we perform a comprehensive characterization and
evaluation of different prompt compression methods. In particular, we analyze
extractive compression, summarization-based abstractive compression, and token
pruning methods. Surprisingly, we find that extractive compression often
outperforms all the other approaches, and enables up to 10x compression with
minimal accuracy degradation. Interestingly, we also find that despite several
recent claims, token pruning methods often lag behind extractive compression.
We only found marginal improvements on summarization tasks.Summary
AI-Generated Summary