Forge-UGC：面向通用图编译器的FX优化与寄存器图引擎

摘要

我们推出Forge-UGC（通用图编译的FX优化与寄存器图引擎），这是一个面向异构加速器硬件部署Transformer的四阶段编译器，已在英特尔AI Boost NPU上完成验证。现有框架如OpenVINO和ONNX Runtime常采用不透明的编译流水线、有限的通道级可见性及薄弱的缓冲区管理，导致较高编译成本和运行时开销。Forge-UGC通过硬件无关的设计解决这些问题，将图捕获、优化、中间表示降级和后端调度分离。第一阶段通过torch.export在ATen算子级捕获计算图，无需手动分解即可支持旋转位置编码、分组查询注意力及SwiGLU等现代Transformer组件。第二阶段应用六种优化通道：死代码消除、公共子表达式消除、常量折叠、注意力融合、算子融合及布局优化，使图节点数减少14.2%至21.9%。第三阶段将优化后的图降级为带有显式虚拟寄存器分配的强类型中间表示。第四阶段执行活跃性分析、线性扫描缓冲区分配（峰值缓冲区数量降低30%至48%）以及设备亲和性调度（NPU-CPU切换减少42%至65%）。在涵盖1.25亿至80亿参数的六个模型系列上，基于WikiText-103和GLUE的评估表明，Forge-UGC相比OpenVINO和ONNX Runtime可实现6.9至9.2倍的编译加速，推理延迟降低18.2%至35.7%，单次推理能耗下降30.2%至40.9%。模型保真度得到保持，最大绝对逻辑差低于2.1e-5，KL散度低于8.4e-9。我们还引入融合增益比、编译效率指数及逐通道执行分析等方法，为NPU编译流水线提供系统化评估框架。

English

We present Forge-UGC (FX Optimization and Register-Graph Engine for Universal Graph Compilation), a four-phase compiler for transformer deployment on heterogeneous accelerator hardware, validated on Intel AI Boost NPU. Existing frameworks such as OpenVINO and ONNX Runtime often use opaque compilation pipelines, limited pass-level visibility, and weak buffer management, which can lead to higher compilation cost and runtime overhead. Forge-UGC addresses this with a hardware-agnostic design that separates graph capture, optimization, intermediate representation lowering, and backend scheduling. Phase 1 captures graphs with torch.export at the ATen operator level, supporting modern transformer components such as rotary position embeddings, grouped-query attention, and SwiGLU without manual decomposition. Phase 2 applies six optimization passes: dead code elimination, common subexpression elimination, constant folding, attention fusion, operator fusion, and layout optimization, reducing graph node count by 14.2 to 21.9%. Phase 3 lowers the optimized graph into a typed intermediate representation with explicit virtual register assignments. Phase 4 performs liveness analysis, linear-scan buffer allocation, reducing peak buffer count by 30 to 48%, and device-affinity scheduling, reducing NPU-CPU transitions by 42 to 65%. Across six model families ranging from 125M to 8B parameters, evaluated on WikiText-103 and GLUE, Forge-UGC delivers 6.9 to 9.2x faster compilation than OpenVINO and ONNX Runtime, 18.2 to 35.7% lower inference latency, and 30.2 to 40.9% lower energy per inference. Fidelity is preserved, with max absolute logit differences below 2.1e-5 and KL divergence below 8.4e-9. We also introduce Fusion Gain Ratio, Compilation Efficiency Index, and per-pass execution profiling for systematic evaluation of NPU compilation pipelines.

Forge-UGC：面向通用图编译器的FX优化与寄存器图引擎

Forge-UGC: FX optimization and register-graph engine for universal graph compiler

摘要

Support