QEIL v2：基於屋頂線模型的帕累托最優能耗建模與多目標協同的邊緣智能異構計算架構

摘要

在异构边缘设备上部署大语言模型（LLM）需要能协同优化能效、推理质量与可靠性的框架。我们先前提出的QEIL v1（Kumar & Jha, 2026）虽实现了4.82倍IPW提升，但依赖静态效率因子、贪婪优化及未经验证的候选方案选择。QEIL v2将所有静态启发式方法替换为基于物理原理且支持运行时自适应的模型。我们引入三项设备-工作负载指标：DASI（基于屋顶线模型的计算利用率）、CPQ（源自分配理论的内存压力）和Phi（基于CMOS漏电物理的热效率），构建出所有系数均可追溯至半导体物理原理的统一能耗方程。优化方面，PGSAM（带动量的帕累托引导模拟退火算法）同步最小化能耗、延迟与设备未充分利用率。推理阶段采用EAC/ARDE选择级联与CSVET早停机制，对重复样本进行渐进式验证。在WikiText-103、GSM8K和ARC-Challenge数据集上对七类模型（125M-8B参数，含一个预量化变体）的评估表明，QEIL v2在63.8W功耗下实现75.7% pass@k（IPW=0.9749），较标准推理提升2.86倍。应用于4比特Llama-3.1-8B模型时，QEIL v2基于物理原理的路由机制在54.8W功耗下达成IPW=1.024——这是首个突破IPW=1.0经验参考值的边缘编排系统，其增益完全归功于QEIL v2对内存带宽需求降低模型的工作负载自适应设备分配。相比标准方案，总能耗降低75.6%，延迟减少38.3%，所有基准测试和模型家族均实现零热降频与100%故障恢复。

English

Deploying large language models (LLMs) on heterogeneous edge devices demands frameworks that jointly optimize energy efficiency, inference quality, and reliability. Our prior QEIL v1 (Kumar & Jha, 2026) achieved 4.82x IPW improvement but relied on static efficiency factors, greedy optimization, and unverified candidate selection. QEIL v2 replaces every static heuristic with physics-grounded, runtime-adaptive models. We introduce three device-workload metrics: DASI (roofline-derived compute utilization), CPQ (memory pressure from allocation theory), and Phi (thermal yield from CMOS leakage physics), forming a unified energy equation with every coefficient traceable to semiconductor physics. For optimization, PGSAM (Pareto-Guided Simulated Annealing with Momentum) simultaneously minimizes energy, latency, and device underutilization. At inference time, the EAC/ARDE selection cascade with CSVET early stopping provides progressive verification among repeated samples. Evaluated on WikiText-103, GSM8K, and ARC-Challenge across seven model families (125M-8B parameters, including one pre-quantized variant), QEIL v2 achieves 75.7% pass@k at 63.8W (IPW=0.9749), a 2.86x improvement over standard inference. When applied to a 4-bit Llama-3.1-8B, QEIL v2's physics-grounded routing achieves IPW=1.024 at 54.8W -- the first edge orchestration system to surpass the IPW=1.0 empirical reference mark, with the gain attributable entirely to QEIL v2's workload-adaptive device allocation on a model with reduced memory bandwidth requirements. Total energy drops 75.6% vs. standard with 38.3% latency reduction, zero thermal throttling, and 100% fault recovery across all benchmarks and model families.

QEIL v2：基於屋頂線模型的帕累托最優能耗建模與多目標協同的邊緣智能異構計算架構

QEIL v2: Heterogeneous Computing for Edge Intelligence via Roofline-Derived Pareto-Optimal Energy Modeling and Multi-Objective Orchestration

摘要

Support