ChatPaper.aiChatPaper

对长上下文推理的提示压缩方法进行特征化

Characterizing Prompt Compression Methods for Long Context Inference

July 11, 2024
作者: Siddharth Jha, Lutfi Eren Erdogan, Sehoon Kim, Kurt Keutzer, Amir Gholami
cs.AI

摘要

长上下文推理在系统层面上带来挑战,增加了计算和内存需求,并且从准确性的角度来看,能够对长上下文进行推理也是具有挑战性的。最近,已经提出了几种方法来压缩提示以减少上下文长度。然而,对比不同提出的方法在不同任务中进行标准化分析的工作很少。这导致了矛盾的结果。为了解决这个问题,我们在这里对不同的提示压缩方法进行了全面的表征和评估。具体来说,我们分析了抽取式压缩、基于摘要的生成式压缩和标记修剪方法。令人惊讶的是,我们发现抽取式压缩通常优于所有其他方法,并且能够实现高达10倍的压缩,准确性下降最小。有趣的是,我们还发现尽管最近有几项声称,标记修剪方法通常落后于抽取式压缩。我们在摘要任务上只发现了轻微的改进。
English
Long context inference presents challenges at the system level with increased compute and memory requirements, as well as from an accuracy perspective in being able to reason over long contexts. Recently, several methods have been proposed to compress the prompt to reduce the context length. However, there has been little work on comparing the different proposed methods across different tasks through a standardized analysis. This has led to conflicting results. To address this, here we perform a comprehensive characterization and evaluation of different prompt compression methods. In particular, we analyze extractive compression, summarization-based abstractive compression, and token pruning methods. Surprisingly, we find that extractive compression often outperforms all the other approaches, and enables up to 10x compression with minimal accuracy degradation. Interestingly, we also find that despite several recent claims, token pruning methods often lag behind extractive compression. We only found marginal improvements on summarization tasks.

Summary

AI-Generated Summary

PDF112November 28, 2024