壓縮後的大型語言模型能否真正行動?對LLM壓縮中代理能力的實證評估
Can Compressed LLMs Truly Act? An Empirical Evaluation of Agentic Capabilities in LLM Compression
May 26, 2025
作者: Peijie Dong, Zhenheng Tang, Xiang Liu, Lujun Li, Xiaowen Chu, Bo Li
cs.AI
摘要
訓練後壓縮技術降低了大型語言模型(LLMs)的計算和記憶體成本,實現了資源高效部署。然而,現有的壓縮基準僅專注於語言建模(例如,困惑度)和自然語言理解任務(例如,GLUE準確率),忽略了代理能力——工作流程、工具使用/函數調用、長上下文理解以及實際應用。我們引入了代理壓縮基準(ACBench),這是首個全面評估壓縮如何影響LLMs代理能力的基準。ACBench涵蓋(1)跨四種能力的12項任務(例如,WorfBench用於工作流程生成,Needle-in-Haystack用於長上下文檢索),(2)量化(GPTQ, AWQ)和剪枝(Wanda, SparseGPT)技術,以及(3)15種模型,包括小型(Gemma-2B)、標準(Qwen2.5 7B-32B)和蒸餾推理LLMs(DeepSeek-R1-Distill)。我們的實驗揭示了壓縮的權衡:4位元量化保留了工作流程生成和工具使用(下降1%-3%),但實際應用準確率下降了10%-15%。我們引入了ERank、Top-k排名相關性和能量來系統化分析。ACBench為在代理場景中優化LLM壓縮提供了可操作的見解。程式碼可在https://github.com/pprp/ACBench找到。
English
Post-training compression reduces the computational and memory costs of large
language models (LLMs), enabling resource-efficient deployment. However,
existing compression benchmarks only focus on language modeling (e.g.,
perplexity) and natural language understanding tasks (e.g., GLUE accuracy),
ignoring the agentic capabilities - workflow, tool use/function call,
long-context understanding and real-world application. We introduce the Agent
Compression Benchmark (ACBench), the first comprehensive benchmark for
evaluating how compression impacts LLMs' agentic abilities. ACBench spans (1)
12 tasks across 4 capabilities (e.g., WorfBench for workflow generation,
Needle-in-Haystack for long-context retrieval), (2) quantization (GPTQ, AWQ)
and pruning (Wanda, SparseGPT), and (3) 15 models, including small (Gemma-2B),
standard (Qwen2.5 7B-32B), and distilled reasoning LLMs (DeepSeek-R1-Distill).
Our experiments reveal compression tradeoffs: 4-bit quantization preserves
workflow generation and tool use (1%-3% drop) but degrades real-world
application accuracy by 10%-15%. We introduce ERank, Top-k Ranking Correlation
and Energy to systematize analysis. ACBench provides actionable insights for
optimizing LLM compression in agentic scenarios. The code can be found in
https://github.com/pprp/ACBench.Summary
AI-Generated Summary