Forge-UGC: 범용 그래프 컴파일러를 위한 FX 최적화 및 레지스터 그래프 엔진

초록

본 논문에서는 Intel AI Boost NPU에서 검증된 이종 가속기 하드웨어용 트랜스포머 배포를 위한 4단계 컴파일러인 Forge-UGC(FX Optimization and Register-Graph Engine for Universal Graph Compilation)를 제시한다. OpenVINO 및 ONNX Runtime과 같은 기존 프레임워크는 불투명한 컴파일 파이프라인, 제한된 패스 수준 가시성, 취약한 버퍼 관리로 인해 높은 컴파일 비용과 런타임 오버헤드를 초래하는 경우가 많다. Forge-UGC는 그래프 캡처, 최적화, 중간 표현 로워링(lowering), 백엔드 스케줄링을 분리하는 하드웨어 비의존적 설계로 이러한 문제를 해결한다. 1단계에서는 torch.export를 통해 ATen 연산자 수준에서 그래프를 캡처하며, 회전 위치 임베딩(rotary position embedding), 그룹화된 질의 어텐션(grouped-query attention), SwiGLU 등 현대적인 트랜스포머 구성 요소를 수동 분해 없이 지원한다. 2단계에서는 데드 코드 제거, 공통 부분 표현 제거, 상수 폴딩, 어텐션 퓨전, 연산자 퓨전, 레이아웃 최적화 등 6가지 최적화 패스를 적용하여 그래프 노드 수를 14.2~21.9% 감소시킨다. 3단계에서는 최적화된 그래프를 명시적 가상 레지스터 할당이 포함된 타입 중간 표현으로 로워링한다. 4단계에서는 라이브니스 분석(liveness analysis)과 선형 스캔 버퍼 할당을 통해 최대 버퍼 사용량을 30~48% 절감하고, 디바이스 선호도 스케줄링(device-affinity scheduling)을 통해 NPU-CPU 전환을 42~65% 줄인다. 125M에서 8B 파라미터에 이르는 6개 모델 패밀리를 WikiText-103과 GLUE로 평가한 결과, Forge-UGC는 OpenVINO 및 ONNX Runtime 대비 컴파일 속도가 6.9~9.2배 빠르고, 추론 지연 시간은 18.2~35.7% 낮으며, 추론 당 에너지는 30.2~40.9% 더 적게 소모되었다. 정확도는 최대 절대 로짓 차이가 2.1e-5 미만, KL 발산이 8.4e-9 미만으로 유지되었다. 또한 NPU 컴파일 파이프라인의 체계적인 평가를 위해 퓨전 이득 비율(Fusion Gain Ratio), 컴파일 효율 지수(Compilation Efficiency Index), 패스 별 실행 프로파일링을 도입하였다.

English

We present Forge-UGC (FX Optimization and Register-Graph Engine for Universal Graph Compilation), a four-phase compiler for transformer deployment on heterogeneous accelerator hardware, validated on Intel AI Boost NPU. Existing frameworks such as OpenVINO and ONNX Runtime often use opaque compilation pipelines, limited pass-level visibility, and weak buffer management, which can lead to higher compilation cost and runtime overhead. Forge-UGC addresses this with a hardware-agnostic design that separates graph capture, optimization, intermediate representation lowering, and backend scheduling. Phase 1 captures graphs with torch.export at the ATen operator level, supporting modern transformer components such as rotary position embeddings, grouped-query attention, and SwiGLU without manual decomposition. Phase 2 applies six optimization passes: dead code elimination, common subexpression elimination, constant folding, attention fusion, operator fusion, and layout optimization, reducing graph node count by 14.2 to 21.9%. Phase 3 lowers the optimized graph into a typed intermediate representation with explicit virtual register assignments. Phase 4 performs liveness analysis, linear-scan buffer allocation, reducing peak buffer count by 30 to 48%, and device-affinity scheduling, reducing NPU-CPU transitions by 42 to 65%. Across six model families ranging from 125M to 8B parameters, evaluated on WikiText-103 and GLUE, Forge-UGC delivers 6.9 to 9.2x faster compilation than OpenVINO and ONNX Runtime, 18.2 to 35.7% lower inference latency, and 30.2 to 40.9% lower energy per inference. Fidelity is preserved, with max absolute logit differences below 2.1e-5 and KL divergence below 8.4e-9. We also introduce Fusion Gain Ratio, Compilation Efficiency Index, and per-pass execution profiling for systematic evaluation of NPU compilation pipelines.

Forge-UGC: 범용 그래프 컴파일러를 위한 FX 최적화 및 레지스터 그래프 엔진

Forge-UGC: FX optimization and register-graph engine for universal graph compiler

초록

Support