壓縮後的大型語言模型能否真正行動？對LLM壓縮中代理能力的實證評估

摘要

訓練後壓縮技術降低了大型語言模型（LLMs）的計算和記憶體成本，實現了資源高效部署。然而，現有的壓縮基準僅專注於語言建模（例如，困惑度）和自然語言理解任務（例如，GLUE準確率），忽略了代理能力——工作流程、工具使用/函數調用、長上下文理解以及實際應用。我們引入了代理壓縮基準（ACBench），這是首個全面評估壓縮如何影響LLMs代理能力的基準。ACBench涵蓋（1）跨四種能力的12項任務（例如，WorfBench用於工作流程生成，Needle-in-Haystack用於長上下文檢索），（2）量化（GPTQ, AWQ）和剪枝（Wanda, SparseGPT）技術，以及（3）15種模型，包括小型（Gemma-2B）、標準（Qwen2.5 7B-32B）和蒸餾推理LLMs（DeepSeek-R1-Distill）。我們的實驗揭示了壓縮的權衡：4位元量化保留了工作流程生成和工具使用（下降1%-3%），但實際應用準確率下降了10%-15%。我們引入了ERank、Top-k排名相關性和能量來系統化分析。ACBench為在代理場景中優化LLM壓縮提供了可操作的見解。程式碼可在https://github.com/pprp/ACBench找到。

English

Post-training compression reduces the computational and memory costs of large language models (LLMs), enabling resource-efficient deployment. However, existing compression benchmarks only focus on language modeling (e.g., perplexity) and natural language understanding tasks (e.g., GLUE accuracy), ignoring the agentic capabilities - workflow, tool use/function call, long-context understanding and real-world application. We introduce the Agent Compression Benchmark (ACBench), the first comprehensive benchmark for evaluating how compression impacts LLMs' agentic abilities. ACBench spans (1) 12 tasks across 4 capabilities (e.g., WorfBench for workflow generation, Needle-in-Haystack for long-context retrieval), (2) quantization (GPTQ, AWQ) and pruning (Wanda, SparseGPT), and (3) 15 models, including small (Gemma-2B), standard (Qwen2.5 7B-32B), and distilled reasoning LLMs (DeepSeek-R1-Distill). Our experiments reveal compression tradeoffs: 4-bit quantization preserves workflow generation and tool use (1%-3% drop) but degrades real-world application accuracy by 10%-15%. We introduce ERank, Top-k Ranking Correlation and Energy to systematize analysis. ACBench provides actionable insights for optimizing LLM compression in agentic scenarios. The code can be found in https://github.com/pprp/ACBench.

壓縮後的大型語言模型能否真正行動？對LLM壓縮中代理能力的實證評估

Can Compressed LLMs Truly Act? An Empirical Evaluation of Agentic Capabilities in LLM Compression

摘要

Support