Forge-UGC: FX optimization and register-graph engine for universal graph compiler

Hugging Face Daily Papers 04/14/26, 12:00 AM Papers

Summary

Forge-UGC is a four-phase universal graph compiler that speeds up transformer deployment on NPUs, cutting compilation time 6.9-9.2×, inference latency 18-36 % and energy 30-41 % versus OpenVINO/ONNX Runtime.

We present Forge-UGC (FX Optimization and Register-Graph Engine for Universal Graph Compilation), a four-phase compiler for transformer deployment on heterogeneous accelerator hardware, validated on Intel AI Boost NPU. Existing frameworks such as OpenVINO and ONNX Runtime often use opaque compilation pipelines, limited pass-level visibility, and weak buffer management, which can lead to higher compilation cost and runtime overhead. Forge-UGC addresses this with a hardware-agnostic design that separates graph capture, optimization, intermediate representation lowering, and backend scheduling. Phase 1 captures graphs with torch.export at the ATen operator level, supporting modern transformer components such as rotary position embeddings, grouped-query attention, and SwiGLU without manual decomposition. Phase 2 applies six optimization passes: dead code elimination, common subexpression elimination, constant folding, attention fusion, operator fusion, and layout optimization, reducing graph node count by 14.2 to 21.9%. Phase 3 lowers the optimized graph into a typed intermediate representation with explicit virtual register assignments. Phase 4 performs liveness analysis, linear-scan buffer allocation, reducing peak buffer count by 30 to 48%, and device-affinity scheduling, reducing NPU-CPU transitions by 42 to 65%. Across six model families ranging from 125M to 8B parameters, evaluated on WikiText-103 and GLUE, Forge-UGC delivers 6.9 to 9.2x faster compilation than OpenVINO and ONNX Runtime, 18.2 to 35.7% lower inference latency, and 30.2 to 40.9% lower energy per inference. Fidelity is preserved, with max absolute logit differences below 2.1e-5 and KL divergence below 8.4e-9. We also introduce Fusion Gain Ratio, Compilation Efficiency Index, and per-pass execution profiling for systematic evaluation of NPU compilation pipelines.

Original Article

View Cached Full Text

Cached at: 04/21/26, 03:38 PM

Paper page - Forge-UGC: FX optimization and register-graph engine for universal graph compiler

Source: https://huggingface.co/papers/2604.16498

Abstract

Forge-UGC is a four-phase compiler for efficient transformer deployment on heterogeneous hardware, offering faster compilation, reduced inference latency, and lower energy consumption compared to existing frameworks.

We present Forge-UGC (FX Optimization and Register-Graph Engine for Universal Graph Compilation), a four-phase compiler fortransformer deploymentonheterogeneous accelerator hardware, validated on Intel AI Boost NPU. Existing frameworks such as OpenVINO and ONNX Runtime often use opaque compilation pipelines, limited pass-level visibility, and weak buffer management, which can lead to higher compilation cost and runtime overhead. Forge-UGC addresses this with a hardware-agnostic design that separates graph capture, optimization, intermediate representation lowering, and backend scheduling. Phase 1 captures graphs withtorch.exportat theATen operator level, supporting modern transformer components such asrotary position embeddings,grouped-query attention, andSwiGLUwithout manual decomposition. Phase 2 applies six optimization passes:dead code elimination,common subexpression elimination,constant folding,attention fusion,operator fusion, andlayout optimization, reducing graph node count by 14.2 to 21.9%. Phase 3 lowers the optimized graph into atyped intermediate representationwith explicitvirtual register assignments. Phase 4 performsliveness analysis,linear-scan buffer allocation, reducing peak buffer count by 30 to 48%, anddevice-affinity scheduling, reducingNPU-CPU transitionsby 42 to 65%. Across six model families ranging from 125M to 8B parameters, evaluated on WikiText-103 and GLUE, Forge-UGC delivers 6.9 to 9.2x faster compilation than OpenVINO and ONNX Runtime, 18.2 to 35.7% lower inference latency, and 30.2 to 40.9% lower energy per inference. Fidelity is preserved, with max absolute logit differences below 2.1e-5 and KL divergence below 8.4e-9. We also introduceFusion Gain Ratio,Compilation Efficiency Index, andper-pass execution profilingfor systematic evaluation of NPU compilation pipelines.

View arXiv page View PDF Add to collection

Get this paper in your agent:

hf papers read 2604\.16498

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2604.16498 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2604.16498 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2604.16498 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Forge-UGC: FX optimization and register-graph engine for universal graph compiler

Paper page - Forge-UGC: FX optimization and register-graph engine for universal graph compiler

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

Benchmark: ONNX Runtime vs HF Transformers vs GGUF for Parakeet TDT 0.6B on CPU-only hardware [D]

KForge: LLM-Driven Cross-Platform Kernel Generation for AI Accelerators

Optimizing Models to Be Fast at Codegen (8 minute read)

GamerForge

A hackable compiler to generate efficient fused GPU kernels for AI models [P]

Submit Feedback

Similar Articles

Benchmark: ONNX Runtime vs HF Transformers vs GGUF for Parakeet TDT 0.6B on CPU-only hardware [D]

KForge: LLM-Driven Cross-Platform Kernel Generation for AI Accelerators

Optimizing Models to Be Fast at Codegen (8 minute read)

A hackable compiler to generate efficient fused GPU kernels for AI models [P]