Forge-UGC: FX optimization and register-graph engine for universal graph compiler
Summary
Forge-UGC is a four-phase universal graph compiler that speeds up transformer deployment on NPUs, cutting compilation time 6.9-9.2×, inference latency 18-36 % and energy 30-41 % versus OpenVINO/ONNX Runtime.
View Cached Full Text
Cached at: 04/21/26, 03:38 PM
Paper page - Forge-UGC: FX optimization and register-graph engine for universal graph compiler
Source: https://huggingface.co/papers/2604.16498
Abstract
Forge-UGC is a four-phase compiler for efficient transformer deployment on heterogeneous hardware, offering faster compilation, reduced inference latency, and lower energy consumption compared to existing frameworks.
We present Forge-UGC (FX Optimization and Register-Graph Engine for Universal Graph Compilation), a four-phase compiler fortransformer deploymentonheterogeneous accelerator hardware, validated on Intel AI Boost NPU. Existing frameworks such as OpenVINO and ONNX Runtime often use opaque compilation pipelines, limited pass-level visibility, and weak buffer management, which can lead to higher compilation cost and runtime overhead. Forge-UGC addresses this with a hardware-agnostic design that separates graph capture, optimization, intermediate representation lowering, and backend scheduling. Phase 1 captures graphs withtorch.exportat theATen operator level, supporting modern transformer components such asrotary position embeddings,grouped-query attention, andSwiGLUwithout manual decomposition. Phase 2 applies six optimization passes:dead code elimination,common subexpression elimination,constant folding,attention fusion,operator fusion, andlayout optimization, reducing graph node count by 14.2 to 21.9%. Phase 3 lowers the optimized graph into atyped intermediate representationwith explicitvirtual register assignments. Phase 4 performsliveness analysis,linear-scan buffer allocation, reducing peak buffer count by 30 to 48%, anddevice-affinity scheduling, reducingNPU-CPU transitionsby 42 to 65%. Across six model families ranging from 125M to 8B parameters, evaluated on WikiText-103 and GLUE, Forge-UGC delivers 6.9 to 9.2x faster compilation than OpenVINO and ONNX Runtime, 18.2 to 35.7% lower inference latency, and 30.2 to 40.9% lower energy per inference. Fidelity is preserved, with max absolute logit differences below 2.1e-5 and KL divergence below 8.4e-9. We also introduceFusion Gain Ratio,Compilation Efficiency Index, andper-pass execution profilingfor systematic evaluation of NPU compilation pipelines.
View arXiv pageView PDFAdd to collection
Get this paper in your agent:
hf papers read 2604\.16498
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2604.16498 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2604.16498 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2604.16498 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
Benchmark: ONNX Runtime vs HF Transformers vs GGUF for Parakeet TDT 0.6B on CPU-only hardware [D]
A benchmark comparing ONNX Runtime, HF Transformers, and GGUF for the Parakeet TDT 0.6B ASR model on CPU-only hardware shows ONNX Runtime achieves 37% faster inference than HF Transformers bfloat16, while GGUF prioritizes memory efficiency.
KForge: LLM-Driven Cross-Platform Kernel Generation for AI Accelerators
KForge is a cross-platform framework that uses two collaborating LLM-based agents to automatically generate and optimize high-performance compute kernels for diverse AI accelerators, achieving significant speedups on NVIDIA B200 and Intel Arc B580 hardware.
Optimizing Models to Be Fast at Codegen (8 minute read)
Morph LLC describes three key techniques—training a speculator on coding output, auto-searching kernels on cheap GPUs, and writing a custom interconnect—to dramatically speed up open models like Qwen and DeepSeek for coding agent workloads, achieving up to 3x speculative decoding speedup and 97-162 tok/s on a $7K GPU.
GamerForge
GamerForge is an AI-powered tool that transforms game, CGI, and VFX assets, allowing creators to enhance and edit digital assets efficiently.
A hackable compiler to generate efficient fused GPU kernels for AI models [P]
The author presents a custom, hackable ML compiler written in Python that lowers LLMs to optimized CUDA kernels through a multi-stage IR pipeline, achieving performance competitive with or superior to PyTorch on specific operations. The article details the compiler's optimization passes, lowering rules, and CLI usage for generating efficient fused GPU kernels.