压缩后的大语言模型能否真正行动？大语言模型压缩中代理能力的实证评估

摘要

后训练压缩技术旨在降低大型语言模型（LLMs）的计算与内存开销，从而实现资源高效部署。然而，现有的压缩基准测试仅聚焦于语言建模（如困惑度）和自然语言理解任务（如GLUE准确率），忽视了模型在代理能力方面的表现——包括工作流生成、工具使用/函数调用、长上下文理解及实际应用。为此，我们推出了首个全面评估压缩对LLMs代理能力影响的基准测试——代理压缩基准（ACBench）。ACBench涵盖：(1) 四大能力维度下的12项任务（例如，工作流生成的WorfBench、长上下文检索的Needle-in-Haystack），(2) 量化（GPTQ、AWQ）与剪枝（Wanda、SparseGPT）技术，以及(3) 15个模型，从小型（Gemma-2B）、标准（Qwen2.5 7B-32B）到蒸馏推理LLMs（DeepSeek-R1-Distill）。实验揭示了压缩的权衡：4位量化虽能保持工作流生成与工具使用能力（仅下降1%-3%），却使实际应用准确率降低10%-15%。我们引入ERank、Top-k排序相关性与能量指标以系统化分析。ACBench为优化代理场景下的LLM压缩提供了可操作的洞见。代码已发布于https://github.com/pprp/ACBench。

English

Post-training compression reduces the computational and memory costs of large language models (LLMs), enabling resource-efficient deployment. However, existing compression benchmarks only focus on language modeling (e.g., perplexity) and natural language understanding tasks (e.g., GLUE accuracy), ignoring the agentic capabilities - workflow, tool use/function call, long-context understanding and real-world application. We introduce the Agent Compression Benchmark (ACBench), the first comprehensive benchmark for evaluating how compression impacts LLMs' agentic abilities. ACBench spans (1) 12 tasks across 4 capabilities (e.g., WorfBench for workflow generation, Needle-in-Haystack for long-context retrieval), (2) quantization (GPTQ, AWQ) and pruning (Wanda, SparseGPT), and (3) 15 models, including small (Gemma-2B), standard (Qwen2.5 7B-32B), and distilled reasoning LLMs (DeepSeek-R1-Distill). Our experiments reveal compression tradeoffs: 4-bit quantization preserves workflow generation and tool use (1%-3% drop) but degrades real-world application accuracy by 10%-15%. We introduce ERank, Top-k Ranking Correlation and Energy to systematize analysis. ACBench provides actionable insights for optimizing LLM compression in agentic scenarios. The code can be found in https://github.com/pprp/ACBench.

压缩后的大语言模型能否真正行动？大语言模型压缩中代理能力的实证评估

Can Compressed LLMs Truly Act? An Empirical Evaluation of Agentic Capabilities in LLM Compression

摘要

Support