Forge-UGC: FX optimization and register-graph engine for universal graph compiler

Hugging Face Daily Papers Papers

Summary

Forge-UGC is a four-phase universal graph compiler that speeds up transformer deployment on NPUs, cutting compilation time 6.9-9.2×, inference latency 18-36 % and energy 30-41 % versus OpenVINO/ONNX Runtime.

We present Forge-UGC (FX Optimization and Register-Graph Engine for Universal Graph Compilation), a four-phase compiler for transformer deployment on heterogeneous accelerator hardware, validated on Intel AI Boost NPU. Existing frameworks such as OpenVINO and ONNX Runtime often use opaque compilation pipelines, limited pass-level visibility, and weak buffer management, which can lead to higher compilation cost and runtime overhead. Forge-UGC addresses this with a hardware-agnostic design that separates graph capture, optimization, intermediate representation lowering, and backend scheduling. Phase 1 captures graphs with torch.export at the ATen operator level, supporting modern transformer components such as rotary position embeddings, grouped-query attention, and SwiGLU without manual decomposition. Phase 2 applies six optimization passes: dead code elimination, common subexpression elimination, constant folding, attention fusion, operator fusion, and layout optimization, reducing graph node count by 14.2 to 21.9%. Phase 3 lowers the optimized graph into a typed intermediate representation with explicit virtual register assignments. Phase 4 performs liveness analysis, linear-scan buffer allocation, reducing peak buffer count by 30 to 48%, and device-affinity scheduling, reducing NPU-CPU transitions by 42 to 65%. Across six model families ranging from 125M to 8B parameters, evaluated on WikiText-103 and GLUE, Forge-UGC delivers 6.9 to 9.2x faster compilation than OpenVINO and ONNX Runtime, 18.2 to 35.7% lower inference latency, and 30.2 to 40.9% lower energy per inference. Fidelity is preserved, with max absolute logit differences below 2.1e-5 and KL divergence below 8.4e-9. We also introduce Fusion Gain Ratio, Compilation Efficiency Index, and per-pass execution profiling for systematic evaluation of NPU compilation pipelines.
Original Article
View Cached Full Text

Cached at: 04/21/26, 03:38 PM

Paper page - Forge-UGC: FX optimization and register-graph engine for universal graph compiler

Source: https://huggingface.co/papers/2604.16498

Abstract

Forge-UGC is a four-phase compiler for efficient transformer deployment on heterogeneous hardware, offering faster compilation, reduced inference latency, and lower energy consumption compared to existing frameworks.

We present Forge-UGC (FX Optimization and Register-Graph Engine for Universal Graph Compilation), a four-phase compiler fortransformer deploymentonheterogeneous accelerator hardware, validated on Intel AI Boost NPU. Existing frameworks such as OpenVINO and ONNX Runtime often use opaque compilation pipelines, limited pass-level visibility, and weak buffer management, which can lead to higher compilation cost and runtime overhead. Forge-UGC addresses this with a hardware-agnostic design that separates graph capture, optimization, intermediate representation lowering, and backend scheduling. Phase 1 captures graphs withtorch.exportat theATen operator level, supporting modern transformer components such asrotary position embeddings,grouped-query attention, andSwiGLUwithout manual decomposition. Phase 2 applies six optimization passes:dead code elimination,common subexpression elimination,constant folding,attention fusion,operator fusion, andlayout optimization, reducing graph node count by 14.2 to 21.9%. Phase 3 lowers the optimized graph into atyped intermediate representationwith explicitvirtual register assignments. Phase 4 performsliveness analysis,linear-scan buffer allocation, reducing peak buffer count by 30 to 48%, anddevice-affinity scheduling, reducingNPU-CPU transitionsby 42 to 65%. Across six model families ranging from 125M to 8B parameters, evaluated on WikiText-103 and GLUE, Forge-UGC delivers 6.9 to 9.2x faster compilation than OpenVINO and ONNX Runtime, 18.2 to 35.7% lower inference latency, and 30.2 to 40.9% lower energy per inference. Fidelity is preserved, with max absolute logit differences below 2.1e-5 and KL divergence below 8.4e-9. We also introduceFusion Gain Ratio,Compilation Efficiency Index, andper-pass execution profilingfor systematic evaluation of NPU compilation pipelines.

View arXiv pageView PDFAdd to collection

Get this paper in your agent:

hf papers read 2604\.16498

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2604.16498 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2604.16498 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2604.16498 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

Optimizing Models to Be Fast at Codegen (8 minute read)

TLDR AI

Morph LLC describes three key techniques—training a speculator on coding output, auto-searching kernels on cheap GPUs, and writing a custom interconnect—to dramatically speed up open models like Qwen and DeepSeek for coding agent workloads, achieving up to 3x speculative decoding speedup and 97-162 tok/s on a $7K GPU.

GamerForge

Product Hunt

GamerForge is an AI-powered tool that transforms game, CGI, and VFX assets, allowing creators to enhance and edit digital assets efficiently.

A hackable compiler to generate efficient fused GPU kernels for AI models [P]

Reddit r/MachineLearning

The author presents a custom, hackable ML compiler written in Python that lowers LLMs to optimized CUDA kernels through a multi-stage IR pipeline, achieving performance competitive with or superior to PyTorch on specific operations. The article details the compiler's optimization passes, lowering rules, and CLI usage for generating efficient fused GPU kernels.