Forge-UGC: FX-optimalisatie en register-grafiek engine voor universele grafiekcompiler

Samenvatting

Wij presenteren Forge-UGC (FX Optimization and Register-Graph Engine for Universal Graph Compilation), een compiler met vier fasen voor de implementatie van transformers op heterogene accelerator-hardware, gevalideerd op de Intel AI Boost NPU. Bestaande frameworks zoals OpenVINO en ONNX Runtime gebruiken vaak ondoorzichtige compilatiepijplijnen, beperkte zichtbaarheid op pass-niveau en zwakke bufferbeheer, wat kan leiden tot hogere compilatiekosten en runtime-overhead. Forge-UGC lost dit op met een hardware-agnostisch ontwerp dat grafiekcapture, optimalisatie, verlaging van de intermediate representation en backend-scheduling scheidt. Fase 1 captureert grafieken met torch.export op het ATen-operatorenniveau, waarbij moderne transformer-componenten zoals rotary position embeddings, grouped-query attention en SwiGLU worden ondersteund zonder handmatige decompositie. Fase 2 past zes optimalisatiepasses toe: eliminatie van dode code, eliminatie van gemeenschappelijke subexpressies, constant folding, attention-fusie, operatorfusie en layoutoptimalisatie, wat het aantal grafiekknopen met 14,2 tot 21,9% reduceert. Fase 3 verlaagt de geoptimaliseerde grafiek naar een getypeerde intermediate representation met expliciete virtuele registertoewijzingen. Fase 4 voert liveness-analyse uit, lineaire-scan-bufferallocatie (vermindert het piek-bufferaantal met 30 tot 48%) en device-affinity-scheduling (vermindert NPU-CPU-overgangen met 42 tot 65%). Over zes modelfamilies, variërend van 125M tot 8B parameters, geëvalueerd op WikiText-103 en GLUE, levert Forge-UGC 6,9 tot 9,2x snellere compilatie dan OpenVINO en ONNX Runtime, 18,2 tot 35,7% lagere inferentielatentie en 30,2 tot 40,9% lager energieverbruik per inferentie. De nauwkeurigheid blijft behouden, met maximale absolute logit-verschillen onder 2,1e-5 en KL-divergentie onder 8,4e-9. Wij introduceren ook de Fusion Gain Ratio, Compilation Efficiency Index en uitvoeringsprofilering per pass voor de systematische evaluatie van NPU-compilatiepijplijnen.

English

We present Forge-UGC (FX Optimization and Register-Graph Engine for Universal Graph Compilation), a four-phase compiler for transformer deployment on heterogeneous accelerator hardware, validated on Intel AI Boost NPU. Existing frameworks such as OpenVINO and ONNX Runtime often use opaque compilation pipelines, limited pass-level visibility, and weak buffer management, which can lead to higher compilation cost and runtime overhead. Forge-UGC addresses this with a hardware-agnostic design that separates graph capture, optimization, intermediate representation lowering, and backend scheduling. Phase 1 captures graphs with torch.export at the ATen operator level, supporting modern transformer components such as rotary position embeddings, grouped-query attention, and SwiGLU without manual decomposition. Phase 2 applies six optimization passes: dead code elimination, common subexpression elimination, constant folding, attention fusion, operator fusion, and layout optimization, reducing graph node count by 14.2 to 21.9%. Phase 3 lowers the optimized graph into a typed intermediate representation with explicit virtual register assignments. Phase 4 performs liveness analysis, linear-scan buffer allocation, reducing peak buffer count by 30 to 48%, and device-affinity scheduling, reducing NPU-CPU transitions by 42 to 65%. Across six model families ranging from 125M to 8B parameters, evaluated on WikiText-103 and GLUE, Forge-UGC delivers 6.9 to 9.2x faster compilation than OpenVINO and ONNX Runtime, 18.2 to 35.7% lower inference latency, and 30.2 to 40.9% lower energy per inference. Fidelity is preserved, with max absolute logit differences below 2.1e-5 and KL divergence below 8.4e-9. We also introduce Fusion Gain Ratio, Compilation Efficiency Index, and per-pass execution profiling for systematic evaluation of NPU compilation pipelines.

Forge-UGC: FX-optimalisatie en register-grafiek engine voor universele grafiekcompiler

Forge-UGC: FX optimization and register-graph engine for universal graph compiler

Samenvatting

Support