ChatPaper.aiChatPaper

能效比智能:衡量本地人工智能的智能效率

Intelligence per Watt: Measuring Intelligence Efficiency of Local AI

November 11, 2025
作者: Jon Saad-Falcon, Avanika Narayan, Hakki Orhun Akengin, J. Wes Griffin, Herumb Shandilya, Adrian Gamarra Lafuente, Medhya Goel, Rebecca Joseph, Shlok Natarajan, Etash Kumar Guha, Shang Zhu, Ben Athiwaratkun, John Hennessy, Azalia Mirhoseini, Christopher Ré
cs.AI

摘要

当前大型语言模型(LLM)的查询处理主要依赖集中式云基础设施中的前沿模型。快速增长的需求使该模式面临压力,云服务商难以同步扩展基础设施。两项技术进展促使我们重新思考这一范式:小型语言模型(≤200亿活跃参数)已在多项任务中达到与前沿模型相当的性能,而本地加速器(如苹果M4 Max)能以交互级延迟运行这些模型。这引出一个关键问题:本地推理能否有效分流集中式基础设施的负载?解答该问题需评估本地模型能否准确响应真实场景查询,以及能否在功耗受限设备(如笔记本电脑)上实现足够高效的运行。我们提出"每瓦智能"(IPW)指标——即任务准确率与单位功耗的比值,用于评估不同模型-加速器组合的本地推理能力与效率。通过对20余个前沿本地模型、8种加速器及100万条真实单轮对话与推理查询(代表典型LLM流量)的大规模实证研究,我们测量了每条查询的准确率、能耗、延迟和功耗。分析揭示三大发现:首先,本地模型能准确响应88.7%的单轮对话与推理查询,准确率因领域而异;其次,2023至2025年间,IPW提升5.3倍,本地查询覆盖率从23.2%增至71.3%;最后,运行相同模型时,本地加速器的IPW至少比云加速器低1.4倍,显示巨大优化空间。这些证明本地推理可有效分流集中式基础设施负载,而IPW是追踪该转型进程的关键指标。我们开源IPW测评工具链,以系统化推进每瓦智能基准测试。
English
Large language model (LLM) queries are predominantly processed by frontier models in centralized cloud infrastructure. Rapidly growing demand strains this paradigm, and cloud providers struggle to scale infrastructure at pace. Two advances enable us to rethink this paradigm: small LMs (<=20B active parameters) now achieve competitive performance to frontier models on many tasks, and local accelerators (e.g., Apple M4 Max) run these models at interactive latencies. This raises the question: can local inference viably redistribute demand from centralized infrastructure? Answering this requires measuring whether local LMs can accurately answer real-world queries and whether they can do so efficiently enough to be practical on power-constrained devices (i.e., laptops). We propose intelligence per watt (IPW), task accuracy divided by unit of power, as a metric for assessing capability and efficiency of local inference across model-accelerator pairs. We conduct a large-scale empirical study across 20+ state-of-the-art local LMs, 8 accelerators, and a representative subset of LLM traffic: 1M real-world single-turn chat and reasoning queries. For each query, we measure accuracy, energy, latency, and power. Our analysis reveals 3 findings. First, local LMs can accurately answer 88.7% of single-turn chat and reasoning queries with accuracy varying by domain. Second, from 2023-2025, IPW improved 5.3x and local query coverage rose from 23.2% to 71.3%. Third, local accelerators achieve at least 1.4x lower IPW than cloud accelerators running identical models, revealing significant headroom for optimization. These findings demonstrate that local inference can meaningfully redistribute demand from centralized infrastructure, with IPW serving as the critical metric for tracking this transition. We release our IPW profiling harness for systematic intelligence-per-watt benchmarking.
PDF63December 2, 2025