圧縮されたLLMは真に行動できるのか？ LLM圧縮におけるエージェント能力の実証的評価

要旨

ポストトレーニング圧縮は、大規模言語モデル（LLM）の計算コストとメモリコストを削減し、リソース効率の良いデプロイメントを可能にします。しかし、既存の圧縮ベンチマークは言語モデリング（例：パープレキシティ）や自然言語理解タスク（例：GLUE精度）にのみ焦点を当てており、エージェント能力（ワークフロー、ツール使用/関数呼び出し、長文脈理解、実世界アプリケーション）を無視しています。本論文では、圧縮がLLMのエージェント能力に与える影響を評価するための最初の包括的なベンチマークであるAgent Compression Benchmark（ACBench）を紹介します。ACBenchは、(1) 4つの能力にわたる12のタスク（例：ワークフロー生成のためのWorfBench、長文脈検索のためのNeedle-in-Haystack）、(2) 量子化（GPTQ、AWQ）とプルーニング（Wanda、SparseGPT）、(3) 小型（Gemma-2B）、標準（Qwen2.5 7B-32B）、蒸留推論LLM（DeepSeek-R1-Distill）を含む15のモデルをカバーしています。実験結果から、4ビット量子化はワークフロー生成とツール使用を維持（1%-3%の低下）する一方、実世界アプリケーションの精度を10%-15%低下させることが明らかになりました。分析を体系化するために、ERank、Top-k Ranking Correlation、Energyを導入しました。ACBenchは、エージェントシナリオにおけるLLM圧縮の最適化に役立つ実践的な洞察を提供します。コードはhttps://github.com/pprp/ACBenchで公開されています。

English

Post-training compression reduces the computational and memory costs of large language models (LLMs), enabling resource-efficient deployment. However, existing compression benchmarks only focus on language modeling (e.g., perplexity) and natural language understanding tasks (e.g., GLUE accuracy), ignoring the agentic capabilities - workflow, tool use/function call, long-context understanding and real-world application. We introduce the Agent Compression Benchmark (ACBench), the first comprehensive benchmark for evaluating how compression impacts LLMs' agentic abilities. ACBench spans (1) 12 tasks across 4 capabilities (e.g., WorfBench for workflow generation, Needle-in-Haystack for long-context retrieval), (2) quantization (GPTQ, AWQ) and pruning (Wanda, SparseGPT), and (3) 15 models, including small (Gemma-2B), standard (Qwen2.5 7B-32B), and distilled reasoning LLMs (DeepSeek-R1-Distill). Our experiments reveal compression tradeoffs: 4-bit quantization preserves workflow generation and tool use (1%-3% drop) but degrades real-world application accuracy by 10%-15%. We introduce ERank, Top-k Ranking Correlation and Energy to systematize analysis. ACBench provides actionable insights for optimizing LLM compression in agentic scenarios. The code can be found in https://github.com/pprp/ACBench.

圧縮されたLLMは真に行動できるのか？ LLM圧縮におけるエージェント能力の実証的評価

Can Compressed LLMs Truly Act? An Empirical Evaluation of Agentic Capabilities in LLM Compression

要旨

Support