Forge-UGC: ユニバーサルグラフコンパイラのためのFX最適化およびレジスタグラフエンジン

要旨

本論文では、ヘテロジニアスなアクセラレータハードウェア向けのトランスフォーマーデプロイメントのための4段階コンパイラであるForge-UGC（FX Optimization and Register-Graph Engine for Universal Graph Compilation）を提案し、Intel AI Boost NPU上で検証する。既存のOpenVINOやONNX Runtimeなどのフレームワークは、不透明なコンパイルパイプライン、限定的なパスレベルの可視性、弱いバッファ管理がしばしば見られ、これらは高いコンパイルコストとランタイムオーバーヘッドの原因となり得る。Forge-UGCは、グラフキャプチャ、最適化、中間表現のロワリング、バックエンドスケジューリングを分離したハードウェア非依存の設計によりこの問題に対処する。第1段階では、torch.exportを用いてATen演算子レベルでグラフをキャプチャし、RoPE（rotary position embeddings）、GQA（grouped-query attention）、SwiGLUといった現代的なトランスフォーマーコンポーネントを手動分解なしでサポートする。第2段階では、デッドコード除去、共通部分式除去、定数畳み込み、アテンション融合、演算子融合、レイアウト最適化の6つの最適化パスを適用し、グラフノード数を14.2%から21.9%削減する。第3段階では、最適化されたグラフを、明示的な仮想レジスタ割り当てを持つ型付き中間表現にロワリングする。第4段階では、ライブネス解析、線形走査バッファ割り当て（ピークバッファ数を30%から48%削減）、およびデバイスアフィニティスケジューリング（NPU-CPU間の遷移を42%から65%削減）を実行する。125Mから8Bパラメータにわたる6つのモデルファミリーをWikiText-103およびGLUEで評価した結果、Forge-UGCはOpenVINOおよびONNX Runtimeと比較して、コンパイル速度が6.9倍から9.2倍高速、推論レイテンシが18.2%から35.7%低減、推論あたりのエネルギー消費が30.2%から40.9%低減となった。忠実性は維持され、最大絶対ロジット差は2.1e-5未満、KLダイバージェンスは8.4e-9未満であった。さらに、NPUコンパイルパイプラインを体系的に評価するためのFusion Gain Ratio、Compilation Efficiency Index、およびパス単位実行プロファイリングを導入する。

English

We present Forge-UGC (FX Optimization and Register-Graph Engine for Universal Graph Compilation), a four-phase compiler for transformer deployment on heterogeneous accelerator hardware, validated on Intel AI Boost NPU. Existing frameworks such as OpenVINO and ONNX Runtime often use opaque compilation pipelines, limited pass-level visibility, and weak buffer management, which can lead to higher compilation cost and runtime overhead. Forge-UGC addresses this with a hardware-agnostic design that separates graph capture, optimization, intermediate representation lowering, and backend scheduling. Phase 1 captures graphs with torch.export at the ATen operator level, supporting modern transformer components such as rotary position embeddings, grouped-query attention, and SwiGLU without manual decomposition. Phase 2 applies six optimization passes: dead code elimination, common subexpression elimination, constant folding, attention fusion, operator fusion, and layout optimization, reducing graph node count by 14.2 to 21.9%. Phase 3 lowers the optimized graph into a typed intermediate representation with explicit virtual register assignments. Phase 4 performs liveness analysis, linear-scan buffer allocation, reducing peak buffer count by 30 to 48%, and device-affinity scheduling, reducing NPU-CPU transitions by 42 to 65%. Across six model families ranging from 125M to 8B parameters, evaluated on WikiText-103 and GLUE, Forge-UGC delivers 6.9 to 9.2x faster compilation than OpenVINO and ONNX Runtime, 18.2 to 35.7% lower inference latency, and 30.2 to 40.9% lower energy per inference. Fidelity is preserved, with max absolute logit differences below 2.1e-5 and KL divergence below 8.4e-9. We also introduce Fusion Gain Ratio, Compilation Efficiency Index, and per-pass execution profiling for systematic evaluation of NPU compilation pipelines.

Forge-UGC: ユニバーサルグラフコンパイラのためのFX最適化およびレジスタグラフエンジン

Forge-UGC: FX optimization and register-graph engine for universal graph compiler

要旨

Support