能效比衡量：本地人工智能的智能效率评估

摘要

当前大型语言模型（LLM）的查询处理主要依赖集中式云基础设施中的前沿模型。快速增长的算力需求使该模式面临巨大压力，云服务商难以同步扩展基础设施。两项技术突破促使我们重新审视这一范式：小型语言模型（≤200亿活跃参数）已在多项任务中达到与前沿模型相当的性能，而本地加速器（如苹果M4 Max）能以交互级延迟运行这些模型。这引出一个关键问题：本地推理能否有效分流集中式基础设施的算力需求？解答该问题需从两方面衡量：本地模型能否准确响应真实场景的查询指令，以及能否在功耗受限设备（如笔记本电脑）上实现足够高效的运行。我们提出"智能每瓦特（IPW）"指标——即任务准确率与单位功耗的比值，用于评估不同模型-加速器组合的本地推理能力与能效。通过对20余个前沿本地模型、8类加速器及100万条真实单轮对话与推理查询（代表主流LLM流量）的大规模实证研究，我们测量了每项查询的准确率、能耗、延迟与功耗。研究发现有三：首先，本地模型能准确处理87.7%的单轮对话与推理查询，准确率因领域而异；其次，2023至2025年间，IPW指标提升5.3倍，本地查询覆盖率从23.2%升至71.3%；最后，运行相同模型时，本地加速器的IPW较云加速器至少降低1.4倍，显示巨大优化空间。这些发现证明本地推理能有效分流集中式算力需求，而IPW可作为追踪该转型进程的关键指标。我们同步开源IPW测评工具链，用于系统化的智能每瓦特基准测试。

English

Large language model (LLM) queries are predominantly processed by frontier models in centralized cloud infrastructure. Rapidly growing demand strains this paradigm, and cloud providers struggle to scale infrastructure at pace. Two advances enable us to rethink this paradigm: small LMs (<=20B active parameters) now achieve competitive performance to frontier models on many tasks, and local accelerators (e.g., Apple M4 Max) run these models at interactive latencies. This raises the question: can local inference viably redistribute demand from centralized infrastructure? Answering this requires measuring whether local LMs can accurately answer real-world queries and whether they can do so efficiently enough to be practical on power-constrained devices (i.e., laptops). We propose intelligence per watt (IPW), task accuracy divided by unit of power, as a metric for assessing capability and efficiency of local inference across model-accelerator pairs. We conduct a large-scale empirical study across 20+ state-of-the-art local LMs, 8 accelerators, and a representative subset of LLM traffic: 1M real-world single-turn chat and reasoning queries. For each query, we measure accuracy, energy, latency, and power. Our analysis reveals 3 findings. First, local LMs can accurately answer 88.7% of single-turn chat and reasoning queries with accuracy varying by domain. Second, from 2023-2025, IPW improved 5.3x and local query coverage rose from 23.2% to 71.3%. Third, local accelerators achieve at least 1.4x lower IPW than cloud accelerators running identical models, revealing significant headroom for optimization. These findings demonstrate that local inference can meaningfully redistribute demand from centralized infrastructure, with IPW serving as the critical metric for tracking this transition. We release our IPW profiling harness for systematic intelligence-per-watt benchmarking.