PassNet: Scaling Large Language Models for Graph Compiler Pass Generation
Summary
This paper introduces PassNet, a large-scale ecosystem for LLM-based compiler pass generation, including a dataset of over 18K computational graphs and a benchmark (PassBench) with a new metric. Experiments reveal that while LLMs can achieve up to 3x speedup over TorchInductor on individual subgraphs, consistency remains a bottleneck; fine-tuning a small model on PassNet trajectories yields significant improvements.
View Cached Full Text
Cached at: 05/29/26, 09:17 AM
# PassNet: Scaling Large Language Models for Graph Compiler Pass Generation
Source: [https://arxiv.org/html/2605.29357](https://arxiv.org/html/2605.29357)
Yiqun Liu Yingsheng Wu11footnotemark:1Ruqi Yang Enrong Zheng Honglei Qiu Sijun He Tai Liang Jingjing Wu Yuhan Zhou Yiwei Zhang Dongyan Chen Weihan Yi Xinqi Li Siqi Bao22footnotemark:2 Baidu, Inc\.
###### Abstract
Modern tensor compilers such as TorchInductor deliver substantial speedups on mainstream models, yet face a systematic performance ceiling on long\-tail workloads—our profiling shows that43%43\\%of real\-world subgraphs experience end\-to\-end slowdowns under default compilation\. While LLMs offer a path toward automated optimization, existing efforts focus on standalone*kernel generation*\. We argue that*pass generation*—where LLMs author structured graph transformations that integrate directly into compiler pipelines—is the more appropriate abstraction\.
We proposePassNet, the first large\-scale ecosystem for LLM\-based compiler pass generation, comprising: \(1\)PassNet\-Dataset, over18K18Kunique computational graphs from100K100Kreal\-world models; and \(2\)PassBench, 200 curated long\-tail fusible tasks \(comprising 2,060 subgraphs in total\) evaluated under theError\-aware Speedup Score \(EStES\_\{t\}\)—a metric unifying correctness, stability, and performance—with layered integrity defenses against systematic LLM exploitation\. Experiments reveal that PassBench is both highly discriminative and genuinely unsaturated: the best frontier model trails TorchInductor by 37% in aggregate, yet on individual subgraphs LLMs achieve up to3×3\\timesspeedup over the same compiler—indicating that the bottleneck is consistency, not capability\. Fine\-tuning a small model on merely∼\{\\sim\}4K PassNet trajectories yields a2\.67×2\.67\\timesimprovement approaching frontier\-model performance, demonstrating substantial headroom and validating PassNet as live training infrastructure for advancing LLM\-driven compiler optimization\. All data, benchmarks, and tooling are publicly available\.
## 1Introduction
Modern deep learning systems increasingly rely on tensor compilers \(e\.g\., TVMChenet al\.\([2018](https://arxiv.org/html/2605.29357#bib.bib27)\), XLAKaufmanet al\.\([2021](https://arxiv.org/html/2605.29357#bib.bib7)\), TorchInductorAnselet al\.\([2024](https://arxiv.org/html/2605.29357#bib.bib10)\)\) to lower high\-level computational graphs into efficient backend implementations for heterogeneous hardware\. These compilers apply expert\-designed, rule\-based*pass pipelines*—sequences of graph transformations such as operator fusion, tiling, and layout selection—that have proven remarkably effective: on mainstream architectures, TorchInductor delivers up to2\.27×2\.27\\timesinference speedups over eager execution across 180\+ modelsAnselet al\.\([2024](https://arxiv.org/html/2605.29357#bib.bib10)\)\. Yet these static strategies face a structural limitation when confronting the long tail of real\-world operator combinations\.
#### The Long\-Tail Optimization Gap\.
To quantify this limitation, we profile TorchInductor’s default pipeline on9,5269\{,\}526subgraphs extracted from over1,0001\{,\}000community models\. The results reveal a systematic limitation:34%34\\%achieve marginal speedups \(<1\.2×<1\.2\\times\),43%43\\%experience end\-to\-end slowdowns, and8\.3%8\.3\\%are strictly degraded\. This gap is structural, correlating with operator coverage rather than graph complexity \(r=0\.013r=0\.013\), suggesting that scaling graph coverage alone cannot overcome the heuristic\-induced performance ceiling\.
#### Pattern Concentration Creates Leverage\.
While the long\-tail gap is significant, computation\-graph patterns exhibit strong power\-law concentration: deduplicating100K100Kmodels yields only∼18K\{\\sim\}18Kdistinct graphs \(82% redundancy\), and∼10,000\{\\sim\}10\{,\}000subgraphs reduce to∼1,025\{\\sim\}1\{,\}025unique structural patterns\. Generating high\-quality passes for this concentrated set suffices to cover most workloads, enabling a shift from*manual rules*to*automated, data\-driven pass generation*\.
#### From Kernel Generation to Pass Generation\.
Large Language Models \(LLMs\) are a promising approach for long\-tail optimization\. Existing work has exploredKernel GenerationOuyanget al\.\([2025](https://arxiv.org/html/2605.29357#bib.bib35)\); Daiet al\.\([2026](https://arxiv.org/html/2605.29357#bib.bib37)\); Liaoet al\.\([2025](https://arxiv.org/html/2605.29357#bib.bib46)\), producing standalone GPU kernels for individual operators\. However, such kernels lack composability with compiler passes, require manual integration for deployment, and are difficult to verify due to unconstrained code generation\.
We therefore define a new task,Pass Generation: given a computational subgraph, an LLM must author a structured compiler pass that interfaces directly with the compiler’s intermediate representation \(formal definition in Section[3](https://arxiv.org/html/2605.29357#S3)\)\. This formulation preserves the “one\-line compilation” experience \(e\.g\.,torch\.compile\) while enabling composable and verifiable optimizations\. However, advancing this task requires large\-scale data and rigorous evaluation—resources that do not yet exist\.
#### The Data and Evaluation Bottleneck\.
Is pass generation with LLMs realistic today? On individual long\-tail subgraphs, frontier LLMs can already generate passes achieving up to3×3\\timesspeedup over the default compiler—yet aggregate performance trails far behind, with no model reaching geometric\-mean speedup above1\.01\.0\. This gap stems from two*infrastructure*bottlenecks: \(1\)*data scarcity*—a lack of large\-scale, specialized corpora for fusion and layout optimization; and \(2\)*evaluation blind spots*—the absence of rigorous benchmarks allowing agents to bypass correctness for illusory gains\. To bridge these, we introducePassNet—the first ecosystem providing systematic data and benchmark support for the pass generation task\.
#### Contributions\.
We formalize*pass generation*, where LLMs author structured graph transformations that integrate into compiler pipelines, and build a large\-scale ecosystem to support this task\.
- •PassNet\-Dataset: We collect over18,08618\{,\}086unique computational graphs from100K100Kreal\-world models across diverse frameworks \(PyTorch, PaddlePaddle\) and task categories\. We design*Recursive Folding*and*Execution\-driven Prefix Analysis*to construct structurally diverse subgraphs at multiple granularities, forming the first large\-scale open training set for pass generation\.
- •PassBench: We construct a benchmark of200200tasks, each consisting of a variable number of long\-tail subgraphs, evaluated under the*Error\-aware Speedup Score*\(EStES\_\{t\}\), which jointly measures correctness, stability, and performance\. To ensure evaluation integrity, we introduce layered defenses including AST\-based inspection, runtime dispatch interception, and reverse evaluation order, which systematically counter exploitation patterns observed during development\.
- •Benchmark Validation: Through extensive evaluation of 6 frontier and open\-source models, we show that PassBench is both highly discriminative \(3\.22×3\.22\\timesgap between model tiers\) and genuinely unsaturated \(best model trails TorchInductor by 37%\)\. Fine\-tuning on merely∼\{\\sim\}4K PassNet trajectories yields a2\.67×2\.67\\timesimprovement, validating the dataset’s utility as training infrastructure\.
## 2Related Work
#### Tensor Compilers\.
Tensor compilers transform high\-level computation graphs into device\-specific kernels via IR lowering and scheduling\. TVMChenet al\.\([2018](https://arxiv.org/html/2605.29357#bib.bib27)\)and AnsorZhenget al\.\([2020](https://arxiv.org/html/2605.29357#bib.bib28)\)are search\-based compilers relying on cost models, while XLALeary and Wang \([2017](https://arxiv.org/html/2605.29357#bib.bib33)\)applies heuristic graph\-level optimizations\. Recent systems include MetaScheduleShaoet al\.\([2022](https://arxiv.org/html/2605.29357#bib.bib11)\), HidetDinget al\.\([2023](https://arxiv.org/html/2605.29357#bib.bib12)\), and BladeDISCZhenget al\.\([2023a](https://arxiv.org/html/2605.29357#bib.bib8)\), with industry frameworks such as CINNTeam \([2021](https://arxiv.org/html/2605.29357#bib.bib5)\)and TorchInductorAnselet al\.\([2024](https://arxiv.org/html/2605.29357#bib.bib10)\)integrated into production systems\. Despite these advances, existing approaches still rely on manual transformation rules and struggle with long\-tail workloads\.
#### LLMs for Code Generation and Compiler Optimization\.
Large Language Models demonstrate strong code generation capabilitiesJianget al\.\([2024](https://arxiv.org/html/2605.29357#bib.bib50)\); Zhenget al\.\([2023b](https://arxiv.org/html/2605.29357#bib.bib49)\), including code completionLiuet al\.\([2025](https://arxiv.org/html/2605.29357#bib.bib54)\), bug repairHuanget al\.\([2025](https://arxiv.org/html/2605.29357#bib.bib52)\); Yanget al\.\([2026](https://arxiv.org/html/2605.29357#bib.bib53)\), and software engineering automationJimenezet al\.\([2024](https://arxiv.org/html/2605.29357#bib.bib66)\); Yanget al\.\([2024](https://arxiv.org/html/2605.29357#bib.bib67)\); Wanget al\.\([2024](https://arxiv.org/html/2605.29357#bib.bib68)\)\. For compiler tasks, LLM CompilerCumminset al\.\([2025](https://arxiv.org/html/2605.29357#bib.bib58)\)pre\-trains on compiler IR, Compiler\-r1Panet al\.\([2025](https://arxiv.org/html/2605.29357#bib.bib38)\)explores RL\-based auto\-tuning, and DeCOSCuiet al\.\([2025](https://arxiv.org/html/2605.29357#bib.bib57)\)studies data\-efficient optimization selection\. TLPZhaiet al\.\([2023](https://arxiv.org/html/2605.29357#bib.bib60)\)and follow\-up workZhaiet al\.\([2024](https://arxiv.org/html/2605.29357#bib.bib39)\)apply language models to tensor program generation\. However, these approaches focus on pass selection or scheduling rather than synthesizing new transformation logic\.
#### LLM\-Driven GPU Kernel Generation\.
Recent work generates GPU kernels directly with LLMs\. KernelBenchOuyanget al\.\([2025](https://arxiv.org/html/2605.29357#bib.bib35)\)benchmarks LLM\-generated kernels, revealing gaps from production compilers\. CUDA AgentDaiet al\.\([2026](https://arxiv.org/html/2605.29357#bib.bib37)\)and KernelevolveLiaoet al\.\([2025](https://arxiv.org/html/2605.29357#bib.bib46)\)scale agentic generation, while STARKDonget al\.\([2025](https://arxiv.org/html/2605.29357#bib.bib40)\), KevinBaronioet al\.\([2025](https://arxiv.org/html/2605.29357#bib.bib48)\), and GeakWanget al\.\([2025](https://arxiv.org/html/2605.29357#bib.bib30)\)explore multi\-agent and RL\-based refinement\. Other efforts include QiMeng\-GEMMZhouet al\.\([2025](https://arxiv.org/html/2605.29357#bib.bib41)\), CUDA\-L1Liet al\.\([2025](https://arxiv.org/html/2605.29357#bib.bib47)\), and AutocompHonget al\.\([2025](https://arxiv.org/html/2605.29357#bib.bib32)\)\. In contrast, PassNet targets*pass generation*, i\.e\., structured transformations that integrate with compiler pipelines\.
#### Performance Benchmarks\.
DL evaluation has evolved from DeepBenchNarang and Research \([2016](https://arxiv.org/html/2605.29357#bib.bib14)\)to MLPerfMattsonet al\.\([2020](https://arxiv.org/html/2605.29357#bib.bib15)\)\. CompilerGymCumminset al\.\([2022](https://arxiv.org/html/2605.29357#bib.bib73)\)provides an RL environment for compiler optimization\. Datasets such as ComPileGrossmanet al\.\([2024](https://arxiv.org/html/2605.29357#bib.bib16)\), TpuGraphsPhothilimthanaet al\.\([2023](https://arxiv.org/html/2605.29357#bib.bib18)\), and TenSetZhenget al\.\([2021](https://arxiv.org/html/2605.29357#bib.bib19)\)provide benchmarks for learned compilers\. PassNet focuses on computational\-graph\-level pass generation for long\-tail optimization, with an evaluation framework that jointly measures correctness, stability, and speedup\.
## 3The PassNet Ecosystem
ThePassNetecosystem bridges the dual infrastructure gaps identified in Section[1](https://arxiv.org/html/2605.29357#S1): the scarcity of large\-scale pass generation corpora and the lack of robust, multi\-dimensional benchmarks\. In the following, we formalize the pass generation task before detailing each ecosystem component\.
### 3\.1Task Formulation
In modern tensor compilers, a*pass*is a self\-contained graph transformation that rewrites a computational graph while preserving its input–output semanticsLattneret al\.\([2021](https://arxiv.org/html/2605.29357#bib.bib29)\); Liet al\.\([2021](https://arxiv.org/html/2605.29357#bib.bib22)\)\. Following the pattern\-based rewriting paradigm adopted by MLIRLattneret al\.\([2021](https://arxiv.org/html/2605.29357#bib.bib29)\)and TorchInductorAnselet al\.\([2024](https://arxiv.org/html/2605.29357#bib.bib10)\), we formalize the core abstractions below\.
###### Definition 3\.1\(Computational Graph\)\.
A computational graph is a DAGG=\(V,E,τ,σ\)G=\(V,E,\\tau,\\sigma\), whereVVis a set of operator nodes,EEencodes data dependencies,τ:V→𝒯\\tau:V\\to\\mathcal\{T\}assigns operator types, andσ:V→ℤ\+\\sigma:V\\to\\mathbb\{Z\}^\{\+\}assigns output shapes\. We writefG:𝒳→𝒴f\_\{G\}:\\mathcal\{X\}\\to\\mathcal\{Y\}for the function computed byGG\.
###### Definition 3\.2\(Compiler Pass\)\.
A compiler pass is a pairπ=\(M,R\)\\pi=\(M,R\), whereM:𝒢→2𝒢M:\\mathcal\{G\}\\to 2^\{\\mathcal\{G\}\}is a pattern matcher identifying optimization\-eligible subgraphs, andR:𝒢→𝒢R:\\mathcal\{G\}\\to\\mathcal\{G\}is a rewriter replacing each matched subgraph with an optimized equivalent\. A passπ\\piisvalidonGGunder tolerancettif:
∀x∈𝒳,err\(fG\(x\),fπ\(G\)\(x\)\)≤t\\forall x\\in\\mathcal\{X\},\\quad\\mathrm\{err\}\(f\_\{G\}\(x\),\\;f\_\{\\pi\(G\)\}\(x\)\)\\leq t\(1\)
Thepass generation taskis: given a*task instance*𝒯=\{G1,…,Gk\}\\mathcal\{T\}=\\\{G\_\{1\},\\ldots,G\_\{k\}\\\}of subgraphs sharing the same operator\-type sequence but varying in shape and dtype, generate a valid passπ\\pithat rewrites everyGi∈𝒯G\_\{i\}\\in\\mathcal\{T\}and improves aggregate runtime performance\. This multi\-graph formulation requiresπ\\pito generalize across varying shapes and data types, precluding shape\-specific hacks\. As opposed to free\-form kernel generation, it further ensures composability with existing compiler pipelines and verifiability through standard compiler infrastructure\.
### 3\.2Dataset Construction
The construction of the dataset consists of two stages:computational graph collectionandadvanced subgraph generation, as illustrated in Figure[1](https://arxiv.org/html/2605.29357#S3.F1)\. The overall goal is to preserve real\-world computation patterns while constructing structurally diverse and optimization\-relevant subgraphs\.
Figure 1:PassNet Dataset Construction Pipeline\.We collect and filter real\-world model graphs, and generate hierarchical training subgraphs via recursive graph splitting and metadata generalization\.#### Graph Collection and Validation\.
We extract computational graphs from real\-world models via a lightweight decorator \(pass\_net\.extract\)\. During execution, symbolic tracing captures operator invocations and tensor dependencies, producing standardized representations \(high\-level IR, weights, and input metadata\)\. To ensure high\-fidelity graphs for downstream subgraph construction, each graph is rigorously validated against five key constraints: runnable, serializable, decomposable, statically analyzable, and custom\-operator accessible \(see Appendix[A](https://arxiv.org/html/2605.29357#A1)\)\.
#### Subgraph Generation Strategies\.
To systematically map the optimization space, we propose three subgraph categories focusing on structural recurrence, fusion potential, and primitive behavior\.
\(1\)Classical Subgraph Selection via Recursive Folding\.AClassical Subgraphis a recurrent structural motif representing common computational patterns within a model family\. To extract such motifs at scale, we designRecursive Folding: the method first linearizes the computation graph into a topological operator sequence, then iteratively applies convolution\-based hashing to identify frequent subsequences and abstract them into symbolic units\. This hierarchical process captures both local idioms \(e\.g\., \[Conv2d, BatchNorm\]→α\\to\\alpha\) and higher\-level compositions \(e\.g\., \[α\\alpha, ReLU\]→β\\to\\beta\), producing a compact set of representative structures \(Figure[2\(a\)](https://arxiv.org/html/2605.29357#S3.F2.sf1)\)\.
\(2\)Fusible Subgraph Discovery via Prefix Analysis\.AFusible Subgraphis a contiguous segment of a computational graph where a compiler can apply operator fusion\. To identify such regions systematically, we designPrefix Analysis, which analyzes the prefix kernel\-count curve\(P,K\(P\)\)\(P,K\(P\)\), whereK\(P\)K\(P\)is the number of kernels launched by the firstPPoperators, and detects plateaus satisfyingK\(P\+1\)=K\(P\)K\(P\+1\)=K\(P\)\. These indicate that operators are absorbed into existing execution units\. Extracting contiguous plateaus yields subgraphs reflecting compiler fusion behavior \(Figure[2\(b\)](https://arxiv.org/html/2605.29357#S3.F2.sf2)\)\.
\(3\)Single\-operator Subgraph\.ASingle\-operator Subgraphconsists of a primitive operator and captures operator\-level behavior, complementing higher\-level structures\.
We apply shape and data type generalization during subgraph instantiation, generating instances with 10 shape configurations and 3 data types\. This introduces variation in computation and optimization difficulty, improving applicability across hardware backends and compiler strategies\.
\(a\)Recursive folding\.\[Conv2d, BatchNorm\]→\\toα\\alpha; \[α\\alpha, ReLU\]→\\toβ\\beta\.
\(b\)Prefix kernel\-count curve \(ResNet\-18\)\.Plateau regions indicate fusible intervals\.
Figure 2:Recursive folding for subgraph selection \(left\) and prefix\-based fusibility analysis \(right\)\.
### 3\.3Dataset Characteristics
The resultingPassNetdataset comprises over18K18Kunique computational graphs derived from100K100Kdiverse models across PyTorch and PaddlePaddle, with four distinguishing properties:
\(i\) Authenticity and Scale:All samples originate from production\-grade community libraries rather than synthetic generators\. With100K100Ksource models yielding18,08618\{,\}086deduplicated graphs \(82% redundancy\), PassNet captures the true distribution of real\-world computation patterns—including long\-tail operator combinations absent from curated benchmarks\.
\(ii\) Structural Diversity:The collection spans six application domains \(NLP: 63\.6%, CV: 27\.0%, Multimodal: 1\.7%, Audio: 1\.2%, Others: 6\.5%\), with model scales from lightweight mobile architectures to1010B\-parameter models and node counts from 2 to298,441298\{,\}441\(median∼29\\sim 2^\{9\}\)\.
\(iii\) Optimization\-Relevant Coverage:We instantiate129129K fusible \(Nops∈\[2,35\]N\_\{\\text\{ops\}\}\\in\[2,35\]\),126126K classical \(Nops∈\[4,62\]N\_\{\\text\{ops\}\}\\in\[4,62\]\), and2424K single\-operator subgraphs—totaling∼279\{\\sim\}279K instances across three complementary granularities, each augmented with 10 shape configurations and 3 data types\.
\(iv\) Interoperability:A unified format for graphs, metadata, and custom operators ensures compatibility with compilers such as TorchInductor, CINN, XLA, and TVM without conversion overhead\.
### 3\.4PassBench Design
While the full dataset serves as training data, rigorous evaluation requires a controlled benchmark with diverse, representative tasks\. We curatePassBenchvia multi\-dimensional bucketing and hierarchical grouping of subgraphs, with each group forming a task instance\. The benchmark primarily focuses on fusible\-subgraph tasks, comprising4,4764\{,\}476training samples and200200high\-quality evaluation samples, with no overlap between training and evaluation sets\. The dataset is further augmented with4,0784\{,\}078classical\-subgraph training samples and200200evaluation counterparts, complemented by1,0291\{,\}029single\-operator samples\. This multi\-tiered composition provides varying levels of task difficulty\.
#### Selection Pipeline\.
Subgraphs are bucketed along three dimensions: operator sequence \(exact\-match\), input shape \(log\-quantized\), and input dtype\. Within each operator\-sequence bucket, we apply stratified sampling with fixed strideσ\\sigma, followed by cross\-shape and dtype\-aware aggregation to form groups corresponding to PassBench tasks \(details in Appendix[B](https://arxiv.org/html/2605.29357#A2)\)\. For evaluation, we select200200operator sequences via a Hidden Markov Model and retain the largest group per sequence\. Selected fusible samples contain1–3961\\text\{\-\-\}396subgraphs \(avg\.=10\\mathrm\{avg\.\}=10\), exhibiting a long\-tail distribution, and yield 2,060 subgraph\-level evaluations in aggregate—providing finer\-grained signal than benchmarks that evaluate at the task level alone \(e\.g\., KernelBench, where each task corresponds to a single kernel\)\.
#### Task Format\.
Each task is packaged as a directory containing \(i\) a Python reference implementation \(GraphModule\), \(ii\) tensor metadata, and \(iii\) runtime metadata\. The agent is required to generate executable transformation passes underpass\_dir/along with a JSON manifest\. A submission is considered successful if it preserves correctness across all specified data types while achieving measurable performance improvement\.
### 3\.5Error\-aware Speedup Metrics
Evaluating compiler passes requires assessing both correctness and performance\. Existing approaches fall short in three aspects: \(1\) treating correctness as binary \(pass/fail\) despite its continuous tolerance; \(2\) producing discrete signals that hinder iterative agent training; and \(3\) operating at the benchmark level, lacking fine\-grained feedback on individual graphs\.
We define a per\-subgraph metric to jointly evaluate correctness and performance\. For each subgraphiiwith measured speedupsis\_\{i\}, we introduce a tolerance thresholdt∈\{−10,…,\|E\|\+1\}t\\in\\\{\-10,\\dots,\|E\|\+1\\\}to control acceptance of error categoriesci∈\{1,2,3\}c\_\{i\}\\in\\\{1,2,3\\\}\(accuracy, compilation, and runtime failures\)\. Fort≤0t\\leq 0, strict correctness is enforced with varying numerical tolerances; fort\>0t\>0, more error categories are progressively forgiven\. Letcorrectt,i\\mathrm\{correct\}\_\{t,i\}be a binary indicator of whether subgraphiisatisfies the correctness criteria under thresholdtt\. Theerror\-aware rectified speedupis defined as:
s^t,i=\{si,correctt,i∧si≥1,sip\+1,correctt,i∧si<1,b𝟏\(t<ci\),otherwise,\\hat\{s\}\_\{t,i\}=\\begin\{cases\}s\_\{i\},&\\mathrm\{correct\}\_\{t,i\}\\land s\_\{i\}\\geq 1,\\\\\[3\.0pt\] s\_\{i\}^\{p\+1\},&\\mathrm\{correct\}\_\{t,i\}\\land s\_\{i\}<1,\\\\\[3\.0pt\] b^\{\\mathbf\{1\}\(t<c\_\{i\}\)\},&\\text\{otherwise,\}\\end\{cases\}\(2\)where parametersp,b∈\(0,1\)p,b\\in\(0,1\)respectively govern the exponential penalty for slowdowns and the base penalty for incorrect executions\. The metric distinguishes three distinct scenarios:\(i\) Speedup \(si≥1s\_\{i\}\\geq 1\):retained for correct executions;\(ii\) Slowdown \(si<1s\_\{i\}<1\):exponentially penalized viapp;\(iii\) Incorrect:assigned penaltybbift<cit<c\_\{i\}, or11if forgiven \(t≥cit\\geq c\_\{i\}\)\.
TheError\-aware Speedup ScoreEStES\_\{t\}is defined as the geometric mean of\{s^t,i\}\\\{\\hat\{s\}\_\{t,i\}\\\}across allNNsubgraphs, with its equivalent factored form and macro\-level derivation detailed in Appendix[D](https://arxiv.org/html/2605.29357#A4):
ESt=\(∏i=1Ns^t,i\)1/N\.ES\_\{t\}=\\Bigl\(\\prod\_\{i=1\}^\{N\}\\hat\{s\}\_\{t,i\}\\Bigr\)^\{1/N\}\.\(3\)
We further aggregateEStES\_\{t\}across the tolerance spectrum via a normalized geometric mean to yield a unified scalar for agent feedback:
AS=∏t=−10\|E\|\+1EStWt/∑s=−10\|E\|\+1Ws,\\mathrm\{AS\}=\\prod\_\{t=\-10\}^\{\|E\|\+1\}ES\_\{t\}^\{\\,W\_\{t\}/\\sum\_\{s=\-10\}^\{\|E\|\+1\}W\_\{s\}\},\(4\)
whereWtW\_\{t\}assigns high weight to the strict\-correctness regime \(t∈\[−5,−3\]t\\in\[\-5,\-3\]\) and decays exponentially toward relaxed tolerances \(details in Appendix[E](https://arxiv.org/html/2605.29357#A5)\)\. Whilefastpfast\_\{p\}\(fraction of correct graphs with speedup\>p\>p\) provides a binary correctness threshold, its discrete nature lacks a smooth signal and operates at benchmark\-level granularity, failing to capture per\-graph variation\. AS instead provides a smooth, continuous feedback signal that jointly reflects correctness and performance gains\.
### 3\.6Evaluation Integrity
A key challenge overlooked by prior work is that LLMs systematically exploit evaluation loopholes\. In KernelBench\-style evaluations, a submitted kernel passes as long as its output matches the reference—even if it simply delegates totorch\.compile\. During PassBench development, we found that 29%–50% of frontier\-model submissions contained some form of exploitation\. We document a three\-stage arms race, where each defense exposed a new attack surface\.
#### Case A: Computation Delegation→\\toAST Inspection\.
The most prevalent pattern is invoking high\-level APIs to bypass explicit operator\-level transformations—e\.g\., callingtorch\.matmul\(in\_1, in\_3\)instead of implementing the fusion logic\. We counter this with AST\-based static analysis that blocks forbidden API calls within non\-exempt functions \(raisingRuntimeError: blocked call\), intercepting 78% of violations\.
#### Case B: Dynamic Evasion→\\toDispatch Interception\.
Static analysis cannot cover dynamic execution paths such as implicit tensor method calls \(e\.g\.,tmp = in\_0 \+ in\_1dispatching toaten\.add\.Tensor\)\. We introduce a runtime monitoring layer viaPoisonDispatchTensor, which overloads\_\_torch\_dispatch\_\_to enforce whitelist\-based operator filtering on the mandatory dispatch path\. This layer exclusively identifies 18% of violations missed by AST inspection\.
#### Case C: Cache Pollution→\\toReverse Evaluation\.
We discovered a*correctness escape*: in conventional “eager\-first” evaluation order, PyTorch’s memory pooling leaves residual data in GPU memory that flawed code \(e\.g\.,return torch\.empty\(\.\.\.\)\) can inadvertently pass validation against\. We adopt “reverse evaluation” \(compiled execution before eager baseline\) to ensure verification within a pristine system state; such cases now receive correctness=0=0\.
These layered defenses are complemented by the*pass\-form mandate*itself: requiring agents to author a pattern matcher and rewriter \(rather than a standalone kernel\) forces structural understanding of the computation graph, making trivial exploitation significantly harder\.
## 4Experiments
We design experiments to validate PassBench as a benchmark and PassNet\-Dataset as training infrastructure:\(Q1\)Does PassBench provide meaningful discrimination across model capabilities, and how do current models compare to traditional compilers?\(Q2\)Can PassNet\-Dataset improve model performance via post\-training, validating its utility as training data?
### 4\.1Setup
#### PassAgent\.
To establish reproducible baselines, we implementPassAgent, a lightweight agentic scaffold for compiler pass synthesis\. Following the dual\-tool paradigmYanget al\.\([2024](https://arxiv.org/html/2605.29357#bib.bib67)\); Wanget al\.\([2024](https://arxiv.org/html/2605.29357#bib.bib68)\); Jainet al\.\([2025](https://arxiv.org/html/2605.29357#bib.bib70)\); Anthropic \([2024](https://arxiv.org/html/2605.29357#bib.bib69)\), PassAgent provides: \(1\)file\_editorfor multi\-file workspace manipulation, and \(2\)pass\_evaluatorfor invoking the PassBench evaluation pipeline with three\-stage diagnostics \(pass matching→\\tocorrectness→\\toperformance\)\. The agent iteratively inspects target graphs, edits pass files, and refines based onASScoreAS\\ Scorefeedback until convergence\. All experiments in this paper are conducted on the fusible\-subgraph tasks, using theEStES\_\{t\}formulation withb=0\.1b=0\.1andp=0p=0\. Full design details are in Appendix[C](https://arxiv.org/html/2605.29357#A3)\.
#### Models and Baselines\.
We evaluate frontier models \(GPT\-5\.4, Claude\-Opus\-4\.6, Claude\-Sonnet\-4\.6\) and open\-source models \(GLM\-5\.1GLM\-5\-Teamet al\.\([2026](https://arxiv.org/html/2605.29357#bib.bib44)\), MiniMax\-M2\.7MiniMaxAI \([2026](https://arxiv.org/html/2605.29357#bib.bib43)\), Qwen3\-30B\-A3B and Qwen3\-4BQwen Team \([2025](https://arxiv.org/html/2605.29357#bib.bib42)\)\)\. Baselines include Eager execution and TorchInductor \(torch\.compile, default mode\), preferred overmax\-autotunedue to CUDA Graph layout freezing and overheads outweighing gains at this scale\. Hardware details are in Appendix[C](https://arxiv.org/html/2605.29357#A3)\.
### 4\.2Main Results
Modelfast\_1\(%\)∗Samp\. CR \(%\)†Sub\. CR \(%\)‡G\-Mean Speedup§AS ScoreEager100\.0100\.0100\.01\.0001\.000Inductor20\.364\.585\.00\.8460\.706GPT\-5\.47\.447\.054\.60\.8210\.410Claude\-Opus\-4\.69\.840\.048\.60\.9220\.410Claude\-Sonnet\-4\.69\.252\.061\.90\.8350\.448GLM\-5\.14\.830\.533\.50\.8440\.240MiniMax\-M2\.71\.023\.523\.00\.6530\.208Qwen3\-4B0\.21\.01\.30\.9530\.108Qwen3\-30B\-A3B0\.67\.511\.80\.6930\.139Qwen3\-4B\-SFT¶3\.027\.528\.50\.8080\.240Qwen3\-30B\-A3B\-SFT\#5\.344\.048\.80\.8090\.371Table 1:Main Results on PassBench\.∗Fraction of correct subgraphs with speedup≥1\.0\\geq 1\.0over eager\.†Fraction of correct samples\.‡Fraction of correct subgraphs\.§Geometric\-mean speedup over correct subgraphs\.¶Qwen3\-4B fine\-tuned on PassNet\.\#Qwen3\-30B\-A3B fine\-tuned on PassNet\.Table[1](https://arxiv.org/html/2605.29357#S4.T1)presents results across five metrics \(definitions in caption\); we highlight four key findings\.
\(1\) PassBench effectively discriminates model capabilities\.The benchmark shows a clear performance gap: Claude\-Sonnet\-4\.6 \(AS==0\.448\) outperforms Qwen3\-30B\-A3B \(AS==0\.139\) by3\.22×3\.22\\times\. Correctness ratios exhibit a similar spread \(Sub\. CR: 61\.9% vs\. 11\.8%\)\. This separation indicates that PassBench provides a meaningful discriminative signal across models\.
\(2\) All models fall substantially short of the compiler baseline\.All models have G\-Mean Speedup below 1\.0, meaning generated code is*slower*than eager execution\. The best frontier model \(Claude\-Opus\-4\.6, G\-Mean==0\.922\) approaches but does not reach parity, and the highest AS score \(Claude\-Sonnet\-4\.6, 0\.448\) trails Inductor \(0\.706\) by 37%\. Even for correct outputs, speedups rarely exceed1\.2×1\.2\\times, indicating limited hardware\-cost awareness in current LLM\-generated kernels\.
\(3\) Low aggregate scores coexist with striking individual successes\.No model reaches AS≥\\geq0\.5 or G\-Mean≥\\geq1\.0, yet on specific long\-tail subgraphs, frontier models deliver up to3\.02×3\.02\\timesspeedup over Inductor \(Section[4\.4](https://arxiv.org/html/2605.29357#S4.SS4)\)\. The contrast between strong per\-instance potential and weak aggregate performance reveals that the challenge is*consistency*: models fail to reliably generalize sparse successes across diverse patterns\. Moreover, persistent failure modes including boundary misalignment, cost\-model blindness, and semantic disruption suggest an unsaturated benchmark\. Improving data and training infrastructure, rather than mere scaling, is key to closing this gap\.
\(4\) Iteration reveals capabilities missed by single\-shot evaluation\.All main results are reported at convergence \(50 iterations\)\. As shown in Figure[3](https://arxiv.org/html/2605.29357#S4.F3), a single evaluation captures only 31%–51% of each agent’s best AS score \(mean 38%\), and 12%–52% of eventually\-passing samples exhibit non\-monotonicpass→\\tofail→\\topasstrajectories as agents explore different generalization strategies\. This differs from KernelBench, where iteration monotonically refines a single kernel\.
Figure 3:Agent performance scales with iteration budget\.AS Score as a function of iteration steps \(up to 50\)\.
### 4\.3Dataset Efficacy via Distillation
The main results show a large performance gap between frontier models and the open\-source models\. To address Q2, we distill expert trajectories into the smaller model and evaluate the resulting performance gains\.
#### Setup\.
We generate PassAgent trajectories from4,4764,476samples using Claude\-Sonnet\-4\.6 \(two trials per instance, up to 50 iterations\), retaining those with AS\>\>0\.1 to obtain3,8993,899training trajectories\. We then fine\-tune Qwen3\-30B\-A3B and Qwen3\-4B with learning rate2×10−52\\\!\\times\\\!10^\{\-5\}\(cosine decay to2×10−62\\\!\\times\\\!10^\{\-6\}\), batch size88,55epochs, and256K256Kcontext length \(full setup in Appendix[C](https://arxiv.org/html/2605.29357#A3)\)\.
#### Results\.
Table[1](https://arxiv.org/html/2605.29357#S4.T1)shows that “Qwen3\-30B\-A3B\-SFT” yields an AS of 0\.371, a2\.67×\\timesgain over the base model \(0\.1390\.139\) and approaching frontier\-model performance \(0\.4100\.410\)\. Sub\. CR and Samp\. CR surge to 48\.8% and 44\.0% from 11\.8% and 7\.5%, respectively\. Similar gains in Qwen3\-4B\-SFT, alongside scaling\-driven improvements in the 30B variant, collectively validate our training dataset\.
This result—achieved with only∼\{\\sim\}4K trajectories from a single teacher—suggests substantial headroom through scaling data collection, multiple teachers, and RL fromEStES\_\{t\}feedback\.
### 4\.4Case Studies
We analyze representative successes and failures to understand the capabilities and limitations of current LLMs\. Unlike kernel\-centric approaches, pass generation enables LLMs to discover*graph\-level rewrite rules*, i\.e\., pattern matchers paired with fused replacements that generalize across samples\. Full implementations are provided in Appendix[H](https://arxiv.org/html/2605.29357#A8)\.
#### Success Case 1: Roll\+Slice Fusion \(MaskFormer\)\.
TorchInductor decomposesrollinto multipleslice\+catops, launching 6 kernels for an 8\-operator subgraph\. The LLM recognizes the equivalence ofroll\(shift=3\) \+ slice\[:128\]to index arithmetic \(idx=\(S\+i−shift\)modS\\text\{idx\}=\(S\+i\-\\text\{shift\}\)\\bmod S\), replacing the entire subgraph with a single fused kernel \(3\.02×\\timesspeedup\)\.
#### Success Case 2: Masked Mean Pooling \(BGE\-Reranker\)\.
TorchInductor fails to fuse a 7\-op chain \(cast→\\tomul→\\tosum→\\tosum→\\toclamp→\\todiv→\\tocat\), yielding a∼50%\{\\sim\}50\\%slowdown vs\. eager\. The LLM identifies the masked mean pooling semantics and generates a single kernel that accumulates∑\(mask⋅hidden\)\\sum\(\\text\{mask\}\\cdot\\text\{hidden\}\)and∑\(mask\)\\sum\(\\text\{mask\}\)in FP32 registers \(2\.90×\\timesspeedup, bitwise\-identical\)\.
Table 2:Sparkle Cases\.Speedups vs\. Eager and Inductor on subgraphs where Inductor underperforms Eager\.In both cases, the compiler loses high\-level semantics after decomposing operations into primitives; the LLM’s advantage is recognizing*composite intent*and directly lowering to fused implementations\.
#### Failure Modes\.
Analysis identifies three systematic bottlenecks in agent\-driven optimization:\(1\) Boundary Misalignment—Agents often misjudge arithmetic intensity, causing inefficient fusion of low\-compute operators \(e\.g\., ReLU\) or redundant re\-implementation of vendor\-optimized primitives \(e\.g\., Conv2d\) in Triton\.\(2\) Cost\-Model Blindness—Lacking hardware awareness \(e\.g\., register pressure, SRAM capacity\), agents employ static tiling across varying shapes, resulting in significant gaps from roofline performance\.\(3\) Semantic Disruption—Local rewrites frequently break optimization chains by replacing standard patterns with opaque kernels, disabling critical features like FlashAttention\-2 routing\. These issues indicate that PassBench poses open problems requiring hardware\-aware reasoning beyond current LLM capabilities\.
## 5Conclusion and Future Work
We presentPassNet, the first large\-scale ecosystem for LLM\-driven compiler pass generation, comprising: \(1\)PassNet\-Dataset, featuring18K18Kunique computational graphs derived from100100K real\-world models; and \(2\)PassBench, a suite of200200curated long\-tail tasks evaluated via theError\-aware Speedup Score \(EStES\_\{t\}\)with layered integrity defenses against LLM exploitation\. Experiments demonstrate that PassBench is highly discriminative and unsaturated: while frontier models trail TorchInductor by 37% in aggregate, individual passes achieve up to3×3\\timesspeedup—identifyingconsistency, rather than capability, as the primary bottleneck\. Notably, fine\-tuning on∼4K\{\\sim\}4KPassNet trajectories yields a2\.67×2\.67\\timesperformance gain, approaching frontier\-model levels and validating PassNet as essential infrastructure for advancing LLM\-driven compiler optimization\.
#### Limitations and Future Directions\.
Current experiments focus on fusible tasks, following a curriculum that progresses from simpler cases to more challenging classical subgraph tasks\. Accordingly, PassBench currently targets inference on a single GPU \(NVIDIA A30\), while generalization to training\-loop optimizations, multi\-device settings, and diverse hardware remains open\. The dataset is skewed toward CV and NLP workloads \(90\.6% combined\), which may limit coverage of emerging domains such as scientific simulation and generative models\. Our anti\-cheating defenses, while effective against observed exploits, cannot guarantee completeness against future adversarial strategies\.
Although evaluation graphs are sourced from public repositories, memorization risk is limited: pass generation requires producing executable pattern matchers and rewriters tailored to each graph structure, not reproducing code snippets; the multi\-graph formulation further requires generalization across shapes and dtypes, reducing overfitting to specific instances\.
Future directions include multi\-device pass generation, extension to more complex tasks \(e\.g\., classical subgraph tasks\), integration of hardware cost models as auxiliary context, reinforcement learning fromEStES\_\{t\}feedback, and continual expansion of the dataset to underrepresented domains\. The entire ecosystem is publicly available\.
## References
- J\. Ansel, E\. Yang, H\. He, N\. Gimelshein, A\. Jain, M\. Voznesensky, B\. Bao, P\. Bell, D\. Berard, E\. Burovski,et al\.\(2024\)PyTorch 2: faster machine learning through dynamic python bytecode transformation and graph compilation\.InInternational Conference on Architectural Support for Programming Languages and Operating Systems \(ASPLOS\),Cited by:[§1](https://arxiv.org/html/2605.29357#S1.p1.1),[§2](https://arxiv.org/html/2605.29357#S2.SS0.SSS0.Px1.p1.1),[§3\.1](https://arxiv.org/html/2605.29357#S3.SS1.p1.1)\.
- Anthropic \(2024\)Solving SWE\-bench sonnet with Sonnet 3\.5 and Anthropics agent infrastructure\.Note:[https://www\.anthropic\.com/engineering/swe\-bench\-sonnet](https://www.anthropic.com/engineering/swe-bench-sonnet)Cited by:[§4\.1](https://arxiv.org/html/2605.29357#S4.SS1.SSS0.Px1.p1.6)\.
- C\. Baronio, P\. Marsella, B\. Pan, S\. Guo, and S\. Alberti \(2025\)Kevin: multi\-turn rl for generating cuda kernels\.arXiv preprint arXiv:2507\.11948\.Cited by:[§2](https://arxiv.org/html/2605.29357#S2.SS0.SSS0.Px3.p1.1)\.
- T\. Chen, T\. Moreau, Z\. Jiang, L\. Zheng, E\. Yan, M\. Cowan, H\. Shen, L\. Wang, Y\. Hu, L\. Ceze, C\. Guestrin, and A\. Krishnamurthy \(2018\)TVM: an automated end\-to\-end optimizing compiler for deep learning\.InUSENIX Conference on Operating Systems Design and Implementation \(OSDI\),pp\. 579–594\.Cited by:[§1](https://arxiv.org/html/2605.29357#S1.p1.1),[§2](https://arxiv.org/html/2605.29357#S2.SS0.SSS0.Px1.p1.1)\.
- T\. Cui, P\. Yew, S\. McCamant, and A\. Zhai \(2025\)DeCOS: data\-efficient reinforcement learning for compiler optimization selection ignited by llm\.InProceedings of the 39th ACM International Conference on Supercomputing,pp\. 943–958\.Cited by:[§2](https://arxiv.org/html/2605.29357#S2.SS0.SSS0.Px2.p1.1)\.
- C\. Cummins, V\. Seeker, D\. Grubisic, B\. Roziere, J\. Gehring, G\. Synnaeve, and H\. Leather \(2025\)LLM compiler: foundation language models for compiler optimization\.InProceedings of the 34th ACM SIGPLAN International Conference on Compiler Construction,pp\. 141–153\.Cited by:[§2](https://arxiv.org/html/2605.29357#S2.SS0.SSS0.Px2.p1.1)\.
- C\. Cummins, B\. Wasti, J\. Guo, B\. Cui, J\. Ansel, S\. Gomez, S\. Jain, J\. Liu, O\. Teytaud, B\. Steiner, Y\. Tian, and H\. Leather \(2022\)CompilerGym: robust, performant compiler optimization environments for ai research\.InProceedings of the 20th IEEE/ACM International Symposium on Code Generation and Optimization,CGO ’22,pp\. 92–105\.External Links:ISBN 9781665405843,[Link](https://doi.org/10.1109/CGO53902.2022.9741258),[Document](https://dx.doi.org/10.1109/CGO53902.2022.9741258)Cited by:[§2](https://arxiv.org/html/2605.29357#S2.SS0.SSS0.Px4.p1.1)\.
- W\. Dai, H\. Wu, Q\. Yu, H\. Gao, J\. Li, C\. Jiang, W\. Lou, Y\. Song, H\. Yu, J\. Chen,et al\.\(2026\)CUDA agent: large\-scale agentic rl for high\-performance cuda kernel generation\.arXiv preprint arXiv:2602\.24286\.Cited by:[§1](https://arxiv.org/html/2605.29357#S1.SS0.SSS0.Px3.p1.1),[§2](https://arxiv.org/html/2605.29357#S2.SS0.SSS0.Px3.p1.1)\.
- Y\. Ding, C\. H\. Yu, B\. Zheng, Y\. Liu, Y\. Wang, and G\. Pekhimenko \(2023\)Hidet: task\-mapping programming paradigm for deep learning tensor programs\.InACM International Conference on Architectural Support for Programming Languages and Operating Systems \(ASPLOS\),pp\. 370–384\.Cited by:[§2](https://arxiv.org/html/2605.29357#S2.SS0.SSS0.Px1.p1.1)\.
- J\. Dong, Y\. Yang, T\. Liu, Y\. Wang, F\. Qi, V\. Tarokh, K\. Rangadurai, and S\. Yang \(2025\)STARK: strategic team of agents for refining kernels\.arXiv preprint arXiv:2510\.16996\.Cited by:[§2](https://arxiv.org/html/2605.29357#S2.SS0.SSS0.Px3.p1.1)\.
- GLM\-5\-Team, :, A\. Zeng, X\. Lv, Z\. Hou, Z\. Du, Q\. Zheng, B\. Chen, D\. Yin, C\. Ge, C\. Huang, C\. Xie, C\. Zhu, C\. Yin, C\. Wang, G\. Pan, H\. Zeng, H\. Zhang, H\. Wang, H\. Chen, J\. Zhang, J\. Jiao, J\. Guo, J\. Wang, J\. Du, J\. Wu, K\. Wang, L\. Li, L\. Fan, L\. Zhong, M\. Liu, M\. Zhao, P\. Du, Q\. Dong, R\. Lu, Shuang\-Li, S\. Cao, S\. Liu, T\. Jiang, X\. Chen, X\. Zhang, X\. Huang, X\. Dong, Y\. Xu, Y\. Wei, Y\. An, Y\. Niu, Y\. Zhu, Y\. Wen, Y\. Cen, Y\. Bai, Z\. Qiao, Z\. Wang, Z\. Wang, Z\. Zhu, Z\. Liu, Z\. Li, B\. Wang, B\. Wen, C\. Huang, C\. Cai, C\. Yu, C\. Li, C\. Hu, C\. Zhang, D\. Zhang, D\. Lin, D\. Yang, D\. Wang, D\. Ai, E\. Zhu, F\. Yi, F\. Chen, G\. Wen, H\. Sun, H\. Zhao, H\. Hu, H\. Zhang, H\. Liu, H\. Zhang, H\. Peng, H\. Tai, H\. Zhang, H\. Liu, H\. Wang, H\. Yan, H\. Ge, H\. Liu, H\. Chu, J\. Zhao, J\. Wang, J\. Zhao, J\. Ren, J\. Wang, J\. Zhang, J\. Gui, J\. Zhao, J\. Li, J\. An, J\. Li, J\. Yuan, J\. Du, J\. Liu, J\. Zhi, J\. Duan, K\. Zhou, K\. Wei, K\. Wang, K\. Luo, L\. Zhang, L\. Sha, L\. Xu, L\. Wu, L\. Ding, L\. Chen, M\. Li, N\. Lin, P\. Ta, Q\. Zou, R\. Song, R\. Yang, S\. Tu, S\. Yang, S\. Wu, S\. Zhang, S\. Li, S\. Li, S\. Fan, W\. Qin, W\. Tian, W\. Zhang, W\. Yu, W\. Liang, X\. Kuang, X\. Cheng, X\. Li, X\. Yan, X\. Hu, X\. Ling, X\. Fan, X\. Xia, X\. Zhang, X\. Zhang, X\. Pan, X\. Zou, X\. Zhang, Y\. Liu, Y\. Wu, Y\. Li, Y\. Wang, Y\. Zhu, Y\. Tan, Y\. Zhou, Y\. Pan, Y\. Zhang, Y\. Su, Y\. Geng, Y\. Yan, Y\. Tan, Y\. Bi, Y\. Shen, Y\. Yang, Y\. Li, Y\. Liu, Y\. Wang, Y\. Li, Y\. Wu, Y\. Zhang, Y\. Duan, Y\. Zhang, Z\. Liu, Z\. Jiang, Z\. Yan, Z\. Zhang, Z\. Wei, Z\. Chen, Z\. Feng, Z\. Yao, Z\. Chai, Z\. Wang, Z\. Zhang, B\. Xu, M\. Huang, H\. Wang, J\. Li, Y\. Dong, and J\. Tang \(2026\)GLM\-5: from vibe coding to agentic engineering\.External Links:2602\.15763,[Link](https://arxiv.org/abs/2602.15763)Cited by:[§4\.1](https://arxiv.org/html/2605.29357#S4.SS1.SSS0.Px2.p1.1)\.
- A\. Grossman, L\. Paehler, K\. Parasyris, T\. Ben\-Nun, J\. Hegna, W\. S\. Moses, J\. M\. M\. Diaz, M\. Trofin, and J\. Doerfert \(2024\)ComPile: a large IR dataset from production sources\.Journal of Data\-centric Machine Learning Research\.Cited by:[§2](https://arxiv.org/html/2605.29357#S2.SS0.SSS0.Px4.p1.1)\.
- C\. Hong, S\. Bhatia, A\. Cheung, and Y\. S\. Shao \(2025\)Autocomp: llm\-driven code optimization for tensor accelerators\.External Links:[Link](https://arxiv.org/abs/2505.18574)Cited by:[§2](https://arxiv.org/html/2605.29357#S2.SS0.SSS0.Px3.p1.1)\.
- K\. Huang, J\. Zhang, X\. Xie, and C\. Chen \(2025\)Seeing is fixing: cross\-modal reasoning with multimodal llms for visual software issue repair\.In2025 40th IEEE/ACM International Conference on Automated Software Engineering \(ASE\),pp\. 1156–1168\.Cited by:[§2](https://arxiv.org/html/2605.29357#S2.SS0.SSS0.Px2.p1.1)\.
- N\. Jain, J\. Singh, M\. Shetty, L\. Zheng, K\. Sen, and I\. Stoica \(2025\)R2e\-gym: procedural environments and hybrid verifiers for scaling open\-weights swe agents\.arXiv preprint arXiv:2504\.07164\.Cited by:[§4\.1](https://arxiv.org/html/2605.29357#S4.SS1.SSS0.Px1.p1.6)\.
- J\. Jiang, F\. Wang, J\. Shen, S\. Kim, and S\. Kim \(2024\)A survey on large language models for code generation\.ACM Transactions on Software Engineering and Methodology\.Cited by:[§2](https://arxiv.org/html/2605.29357#S2.SS0.SSS0.Px2.p1.1)\.
- C\. E\. Jimenez, J\. Yang, A\. Wettig, S\. Yao, K\. Pei, O\. Press, and K\. Narasimhan \(2024\)SWE\-bench: can language models resolve real\-World GitHub issues?\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§2](https://arxiv.org/html/2605.29357#S2.SS0.SSS0.Px2.p1.1)\.
- S\. J\. Kaufman, P\. M\. Phothilimthana, Y\. Zhou, C\. Mendis, S\. Roy, A\. Sabne, and M\. Burrows \(2021\)A learned performance model for tensor processing units\.InConference on Machine Learning and Systems \(MLSys\),Cited by:[§1](https://arxiv.org/html/2605.29357#S1.p1.1)\.
- C\. Lattner, M\. Amini, U\. Bondhugula, A\. Cohen, A\. Davis, J\. Pienaar, R\. Riddle, T\. Shpeisman, N\. Vasilache, and O\. Zinenko \(2021\)MLIR: scaling compiler infrastructure for domain specific computation\.InInternational Symposium on Code Generation and Optimization \(CGO\),pp\. 2–14\.Cited by:[§3\.1](https://arxiv.org/html/2605.29357#S3.SS1.p1.1)\.
- C\. Leary and T\. Wang \(2017\)XLA: tensorflow, compiled\.TensorFlow Dev Summit2\(3\)\.Cited by:[§2](https://arxiv.org/html/2605.29357#S2.SS0.SSS0.Px1.p1.1)\.
- M\. Li, Y\. Liu, X\. Liu, Q\. Sun, X\. You, H\. Yang, Z\. Luan, L\. Gan, G\. Yang, and D\. Qian \(2021\)The deep learning compiler: a comprehensive survey\.IEEE Transactions on Parallel and Distributed Systems32\(3\),pp\. 708–727\.External Links:ISSN 2161\-9883,[Document](https://dx.doi.org/10.1109/tpds.2020.3030548)Cited by:[§3\.1](https://arxiv.org/html/2605.29357#S3.SS1.p1.1)\.
- X\. Li, X\. Sun, A\. Wang, J\. Li, and C\. Shum \(2025\)Cuda\-l1: improving cuda optimization via contrastive reinforcement learning\.arXiv preprint arXiv:2507\.14111\.Cited by:[§2](https://arxiv.org/html/2605.29357#S2.SS0.SSS0.Px3.p1.1)\.
- G\. Liao, H\. Qin, Y\. Wang, A\. Golden, M\. Kuchnik, Y\. Yetim, J\. J\. Ang, C\. Fu, Y\. He, S\. Hsia,et al\.\(2025\)Kernelevolve: scaling agentic kernel coding for heterogeneous ai accelerators at meta\.arXiv preprint arXiv:2512\.23236\.Cited by:[§1](https://arxiv.org/html/2605.29357#S1.SS0.SSS0.Px3.p1.1),[§2](https://arxiv.org/html/2605.29357#S2.SS0.SSS0.Px3.p1.1)\.
- K\. Liu, Q\. Wang, X\. Chen, G\. Yang, Y\. Feng, G\. Liu, and J\. Liu \(2025\)Evaluating and improving framework\-based parallel code completion with large language models\.In2025 40th IEEE/ACM International Conference on Automated Software Engineering \(ASE\),pp\. 2478–2490\.Cited by:[§2](https://arxiv.org/html/2605.29357#S2.SS0.SSS0.Px2.p1.1)\.
- P\. Mattson, V\. J\. Reddi, C\. Cheng, C\. Coleman, G\. Diamos, D\. Kanter, P\. Micikevicius, D\. Patterson, G\. Schmuelling, H\. Tang,et al\.\(2020\)MLPerf: an industry standard benchmark suite for machine learning performance\.IEEE Micro40\(2\),pp\. 8–16\.Cited by:[§2](https://arxiv.org/html/2605.29357#S2.SS0.SSS0.Px4.p1.1)\.
- MiniMaxAI \(2026\)External Links:[Link](https://github.com/MiniMax-AI/MiniMax-M2.7)Cited by:[§4\.1](https://arxiv.org/html/2605.29357#S4.SS1.SSS0.Px2.p1.1)\.
- S\. Narang and B\. Research \(2016\)DeepBench: benchmarking deep learning operations on different hardware\.External Links:[Link](https://github.com/baidu-research/DeepBench)Cited by:[§2](https://arxiv.org/html/2605.29357#S2.SS0.SSS0.Px4.p1.1)\.
- A\. Ouyang, S\. Guo, S\. Arora, A\. L\. Zhang, W\. Hu, C\. Ré, and A\. Mirhoseini \(2025\)Kernelbench: can llms write efficient gpu kernels?\.arXiv preprint arXiv:2502\.10517\.Cited by:[§1](https://arxiv.org/html/2605.29357#S1.SS0.SSS0.Px3.p1.1),[§2](https://arxiv.org/html/2605.29357#S2.SS0.SSS0.Px3.p1.1)\.
- H\. Pan, H\. Lin, H\. Luo, Y\. Liu, K\. Yao, L\. Zhang, M\. Xing, and Y\. Wu \(2025\)Compiler\-r1: towards agentic compiler auto\-tuning with reinforcement learning\.arXiv preprint arXiv:2506\.15701\.Cited by:[§2](https://arxiv.org/html/2605.29357#S2.SS0.SSS0.Px2.p1.1)\.
- P\. M\. Phothilimthana, S\. Abu\-El\-Haija, K\. Cao, B\. Fatemi, M\. Burrows, C\. Mendis, and B\. Perozzi \(2023\)TpuGraphs: a performance prediction dataset on large tensor computational graphs\.InConference on Neural Information Processing Systems \(NeurIPS\),Cited by:[§2](https://arxiv.org/html/2605.29357#S2.SS0.SSS0.Px4.p1.1)\.
- Qwen Team \(2025\)Qwen3 technical report\.External Links:2505\.09388,[Link](https://arxiv.org/abs/2505.09388)Cited by:[§4\.1](https://arxiv.org/html/2605.29357#S4.SS1.SSS0.Px2.p1.1)\.
- J\. Shao, X\. Zhou, S\. Feng, B\. Hou, R\. Lai, H\. Jin, W\. Lin, M\. Masuda, C\. H\. Yu, and T\. Chen \(2022\)Tensor program optimization with probabilistic programs\.Advances in Neural Information Processing Systems35,pp\. 35783–35796\.Cited by:[§2](https://arxiv.org/html/2605.29357#S2.SS0.SSS0.Px1.p1.1)\.
- B\. P\. Team \(2021\)CINN: compiling intermediate representation to neural networks\.Note:PaddlePaddle Compiler InfrastructureExternal Links:[Link](https://github.com/PaddlePaddle/CINN)Cited by:[§2](https://arxiv.org/html/2605.29357#S2.SS0.SSS0.Px1.p1.1)\.
- J\. Wang, V\. Joshi, S\. Majumder, X\. Chao, B\. Ding, Z\. Liu, P\. P\. Brahma, D\. Li, Z\. Liu, and E\. Barsoum \(2025\)Geak: introducing triton kernel ai agent & evaluation benchmarks\.External Links:[Link](https://arxiv.org/abs/2507.23194)Cited by:[§2](https://arxiv.org/html/2605.29357#S2.SS0.SSS0.Px3.p1.1)\.
- X\. Wang, B\. Chen, Y\. Yuan, Y\. Zhang, B\. Li, H\. Qian, P\. He, R\. Lyu, Y\. Ma, Z\. Yu,et al\.\(2024\)OpenHands: an open platform for AI software developers as generalist agents\.arXiv preprint arXiv:2407\.16741\.Cited by:[§2](https://arxiv.org/html/2605.29357#S2.SS0.SSS0.Px2.p1.1),[§4\.1](https://arxiv.org/html/2605.29357#S4.SS1.SSS0.Px1.p1.6)\.
- B\. Yang, H\. Tian, J\. Ren, H\. Zhang, J\. Klein, T\. F\. Bissyandé, C\. L\. Goues, and S\. Jin \(2026\)Morepair: teaching llms to repair code via multi\-objective fine\-tuning\.ACM Transactions on Software Engineering and Methodology35\(2\),pp\. 1–38\.Cited by:[§2](https://arxiv.org/html/2605.29357#S2.SS0.SSS0.Px2.p1.1)\.
- J\. Yang, C\. E\. Jimenez, A\. Wettig, K\. Lieret, S\. Yao, K\. Narasimhan, and O\. Press \(2024\)SWE\-agent: agent\-computer interfaces enable automated software engineering\.arXiv preprint arXiv:2405\.15793\.Cited by:[§2](https://arxiv.org/html/2605.29357#S2.SS0.SSS0.Px2.p1.1),[§4\.1](https://arxiv.org/html/2605.29357#S4.SS1.SSS0.Px1.p1.6)\.
- Y\. Zhai, S\. Yang, K\. Pan, R\. Zhang, S\. Liu, C\. Liu, Z\. Ye, J\. Ji, J\. Zhao, Y\. Zhang,et al\.\(2024\)Enabling tensor language model to assist in generating high\-performance tensor programs for deep learning\.In18th USENIX Symposium on Operating Systems Design and Implementation \(OSDI 24\),pp\. 289–305\.Cited by:[§2](https://arxiv.org/html/2605.29357#S2.SS0.SSS0.Px2.p1.1)\.
- Y\. Zhai, Y\. Zhang, S\. Liu, X\. Chu, J\. Peng, J\. Ji, and Y\. Zhang \(2023\)Tlp: a deep learning\-based cost model for tensor program tuning\.InProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2,pp\. 833–845\.Cited by:[§2](https://arxiv.org/html/2605.29357#S2.SS0.SSS0.Px2.p1.1)\.
- L\. Zheng, C\. Cheng, E\. Yan, H\. Ning, C\. H\. Yu, T\. Moreau, T\. Chen, C\. Guestrin, A\. Krishnamurthy, and L\. Ceze \(2020\)Ansor: generating high\-performance tensor programs for deep learning\.InUSENIX Symposium on Operating Systems Design and Implementation \(OSDI\),pp\. 935–950\.Cited by:[§2](https://arxiv.org/html/2605.29357#S2.SS0.SSS0.Px1.p1.1)\.
- L\. Zheng, R\. Liu, J\. Shao, T\. Chen, J\. E\. Gonzalez, I\. Stoica, and A\. H\. Ali \(2021\)TenSet: a large\-scale program performance dataset for learned tensor compilers\.InConference on Neural Information Processing Systems \(\(NeurIPS\)\),Cited by:[§2](https://arxiv.org/html/2605.29357#S2.SS0.SSS0.Px4.p1.1)\.
- Z\. Zheng, Z\. Pan, D\. Wang, K\. Zhu, W\. Zhao, T\. Guo, X\. Qiu, M\. Sun, J\. Bai, F\. Zhang, X\. Du, J\. Zhai, and W\. Lin \(2023a\)BladeDISC: optimizing dynamic shape machine learning workloads via compiler approach\.Proc\. ACM Manag\. Data,pp\. 206:1–206:29\.External Links:[Link](https://doi.org/10.1145/3617327)Cited by:[§2](https://arxiv.org/html/2605.29357#S2.SS0.SSS0.Px1.p1.1)\.
- Z\. Zheng, K\. Ning, Y\. Wang, J\. Zhang, D\. Zheng, M\. Ye, and J\. Chen \(2023b\)A survey of large language models for code: evolution, benchmarking, and future trends\.arXiv preprint arXiv:2311\.10372\.Cited by:[§2](https://arxiv.org/html/2605.29357#S2.SS0.SSS0.Px2.p1.1)\.
- Q\. Zhou, Y\. Wen, R\. Chen, K\. Gao, W\. Xiong, L\. Li, Q\. Guo, Y\. Wu, and Y\. Chen \(2025\)QiMeng\-gemm: automatically generating high\-performance matrix multiplication code by exploiting large language models\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.39,pp\. 22982–22990\.Cited by:[§2](https://arxiv.org/html/2605.29357#S2.SS0.SSS0.Px3.p1.1)\.
## Appendix ADataset Quality Constraints
A user’s model, wrapped by thepass\_net\.extract, is symbolically traced to generate a standardized set of files\. This set forms a complete PassNet graph, including the high\-level IR of the computation graph \(model\.py\), metadata for inputs and weights \(input\_meta\.py,weight\_meta\.py\), SHA\-based graph hash for deduplication \(graph\_hash\.txt\), and other components such as optional custom operator code\.
We define five constraints applied to every computational graph in PassNet to ensure dataset quality and cross\-platform compatibility:
- •Runnable: Each graph must execute forward propagation under the designated framework without syntax errors or crashes\.
- •Serializable: Each sample and metadata must be serializable into standard formats \(e\.g\., JSON\) and correctly de\-serializable\.
- •Decomposable: The entire computational graph must be decomposable into multiple non\-overlapping subgraphs, where each subgraph represents an independent optimization unit\. This supports compiler backends in performing fusion, scheduling, and other optimization tasks\.
- •Statically Analyzable: Operator names, types, and dependencies must be statically extractable \(e\.g\., viatorch\.fx\) for structural traversal\. This allows automated analysis tools to fully interpret operator semantics for structural traversal and pattern matching\.
- •Custom Operator Accessible: If a sample includes user\-defined custom operators, the corresponding source code for these operators must be traceable and accessible in a modular form, ensuring reusability and integration across compiler environments\.
## Appendix BPassBench Sampling Details
#### Multi\-dimensional Bucketing\.
Subgraphs are grouped into discrete buckets defined by three complementary dimensions:
- •Operator Sequence: The ordered operator\-name list serves as an exact\-match key\.
- •Input Shape: We apply logarithmic quantization⌊log2\(d\)/4⌋\\lfloor\\log\_\{2\}\(d\)/4\\rfloorto each dimensiondd, where the scaling factor44is empirically chosen to balance granularity and generalization\.
- •Input Dtype: Exact\-match key for numerical precision\.
#### Hierarchical Representative Grouping\.
Following bucketing, we employ a hierarchical strategy to select representative subgraphs and construct groups with varying cardinalities, each corresponding to a PassBench sample\. The group size controls the optimization difficulty, as larger and more heterogeneous groups require transformations that generalize across a broader set of subgraphs\.
- •Intra\-bucket Stratified Sampling: Within each operator sequence, subgraphs are sampled at a fixed strideσ\\sigma, and then organized into groups of size 1 and 3, capturing both single\-operator cases and short compositional patterns\.
- •Cross\-shape Structural Aggregation: For each unique operator sequence, one representative is aggregated with subgraphs sharing the same sequence across different input shape buckets\.
- •Precision\-aware Coverage: Within each shape bucket, subgraphs with distinct data types \(FP32, FP16, BF16\) are aggregated to ensure numerical format coverage\.
## Appendix CExperimental Setup
### C\.1Benchmark Evaluation
Table[3](https://arxiv.org/html/2605.29357#A3.T3)summarizes the hardware and evaluation protocol used across all PassBench experiments\.
Table 3:Evaluation environment and benchmark protocol\.
### C\.2Distillation and Post\-training
Table[4](https://arxiv.org/html/2605.29357#A3.T4)details the teacher\-student configuration and SFT hyperparameters\.
Table 4:Distillation and post\-training configuration\.
## Appendix DGraph\-level Interpretation ofEStES\_\{t\}
In this section, we show thatEStES\_\{t\}can be equivalently expressed as a geometric mean of per\-graph error\-aware rectified speedups\.
We first definert,ir\_\{t,i\}, the penalty factor for erroneous graphs that captures tolerance\-dependent penalties\. The per\-graph error\-aware rectified speedups^t,i\\hat\{s\}\_\{t,i\}\(defined in Section[3\.5](https://arxiv.org/html/2605.29357#S3.SS5)\) usesrt,ir\_\{t,i\}in its third case\.
###### Definition D\.1\(Penalty Factor\)\.
For each erroneous graphii, letci∈\{1,2,3\}c\_\{i\}\\in\\\{1,2,3\\\}denote its error category\. LetNerrN\_\{\\text\{err\}\}be the number of erroneous graphs, andNcN\_\{c\}be the number of graphs with error categorycc\. The*penalty factor*is
rt,i=\{b,t<ci,1,otherwise,r\_\{t,i\}=\\begin\{cases\}b,&t<c\_\{i\},\\\\\[3\.0pt\] 1,&\\text\{otherwise\},\\end\{cases\}\(5\)whereb∈\(0,1\)b\\in\(0,1\)is the base penalty\.
Letπc=Nc/Nerr\\pi\_\{c\}=N\_\{c\}/N\_\{\\text\{err\}\}denote the fraction of error categoryccamong all erroneous graphs\.
###### Definition D\.2\(Error\-aware Speedup ScoreEStES\_\{t\}\)\.
The macro\-level Error\-aware Speedup ScoreEStES\_\{t\}admits the following factored form:
ESt=αλ⋅βλη\(p\+1\)⋅γt1−λES\_\{t\}=\\alpha^\{\\lambda\}\\cdot\\beta^\{\\lambda\\eta\(p\+1\)\}\\cdot\\gamma\_\{t\}^\{1\-\\lambda\}\(6\)whereγt=b∑c∈\{1,2,3\}πc1\(t<c\)\\gamma\_\{t\}=b^\{\\sum\_\{c\\in\\\{1,2,3\\\}\}\\pi\_\{c\}\\,\\mathbf\{1\}\(t<c\)\}\. Hereα\\alphaaggregates correct\-and\-fast subgraphs,β\\betapenalizes correct\-but\-slow ones, andγt\\gamma\_\{t\}accounts for errors under tolerancett\.
###### Proposition D\.1\(Geometric mean of penalty factors\)\.
The aggregated penaltyγt\\gamma\_\{t\}equals the geometric mean of\{rt,i\}\\\{r\_\{t,i\}\\\}over all erroneous graphs:
γt=\(∏i=1Nerrrt,i\)1/Nerr\\gamma\_\{t\}=\\left\(\\prod\_\{i=1\}^\{N\_\{\\text\{err\}\}\}r\_\{t,i\}\\right\)^\{1/N\_\{\\text\{err\}\}\}\(7\)From the definition ofrt,ir\_\{t,i\}, each term contributesbbifft<cit<c\_\{i\}:
∏i=1Nerrrt,i=∏i=1Nerrb𝟏\(t<ci\)=b∑i=1Nerr𝟏\(t<ci\)\\prod\_\{i=1\}^\{N\_\{\\text\{err\}\}\}r\_\{t,i\}=\\prod\_\{i=1\}^\{N\_\{\\text\{err\}\}\}b^\{\\mathbf\{1\}\(t<c\_\{i\}\)\}=b^\{\\sum\_\{i=1\}^\{N\_\{\\text\{err\}\}\}\\mathbf\{1\}\(t<c\_\{i\}\)\}Grouping by error categoryccgives
∑i=1Nerr𝟏\(t<ci\)=∑c∈\{1,2,3\}Nc1\(t<c\)\\sum\_\{i=1\}^\{N\_\{\\text\{err\}\}\}\\mathbf\{1\}\(t<c\_\{i\}\)=\\sum\_\{c\\in\\\{1,2,3\\\}\}N\_\{c\}\\,\\mathbf\{1\}\(t<c\)thus
∏i=1Nerrrt,i=b∑c∈\{1,2,3\}Nc1\(t<c\)\\prod\_\{i=1\}^\{N\_\{\\text\{err\}\}\}r\_\{t,i\}=b^\{\\sum\_\{c\\in\\\{1,2,3\\\}\}N\_\{c\}\\,\\mathbf\{1\}\(t<c\)\}Taking theNerrN\_\{\\text\{err\}\}th root:
\(∏i=1Nerrrt,i\)1/Nerr=b1Nerr∑c∈\{1,2,3\}Nc1\(t<c\)=b∑c∈\{1,2,3\}πc1\(t<c\)\\left\(\\prod\_\{i=1\}^\{N\_\{\\text\{err\}\}\}r\_\{t,i\}\\right\)^\{1/N\_\{\\text\{err\}\}\}=b^\{\\frac\{1\}\{N\_\{\\text\{err\}\}\}\\sum\_\{c\\in\\\{1,2,3\\\}\}N\_\{c\}\\,\\mathbf\{1\}\(t<c\)\}=b^\{\\sum\_\{c\\in\\\{1,2,3\\\}\}\\pi\_\{c\}\\,\\mathbf\{1\}\(t<c\)\}which is exactly the definition ofγt\\gamma\_\{t\}\.
###### Proposition D\.2\(Geometric mean of per\-graph speedups\)\.
Sinceγt\\gamma\_\{t\}is the geometric mean of\{rt,i\}\\\{r\_\{t,i\}\\\}for a giventt, the macro\-level metricEStES\_\{t\}equals the geometric mean of per\-graph error\-aware rectified speedups:
ESt=\(∏i=1Ns^t,i\)1/NES\_\{t\}=\\left\(\\prod\_\{i=1\}^\{N\}\\hat\{s\}\_\{t,i\}\\right\)^\{1/N\}\(8\)This follows from the factored form:α\\alpha,β\\beta, andγt\\gamma\_\{t\}are respectively the geometric means of the speedups from the correct\-fast, correct\-slow, and erroneous cases ins^t,i\\hat\{s\}\_\{t,i\}\. Their product therefore equals the geometric mean of all\{s^t,i\}\\\{\\hat\{s\}\_\{t,i\}\\\}\.
## Appendix EAggregated Speedup \(AS\) Weight Specification
The Aggregated Speedup aggregates the tolerance\-parameterizedEStES\_\{t\}into a single scalar via a normalized geometric mean \(Section[3\.5](https://arxiv.org/html/2605.29357#S3.SS5)\)\. Given the weight functionWtW\_\{t\}, AS is defined as:
AS=∏t=−10\|E\|\+1EStWt/∑s=−10\|E\|\+1Ws\\mathrm\{AS\}=\\prod\_\{t=\-10\}^\{\|E\|\+1\}ES\_\{t\}^\{\\,W\_\{t\}/\\sum\_\{s=\-10\}^\{\|E\|\+1\}W\_\{s\}\}\(9\)
The weight functionWtW\_\{t\}is defined as:
Wt=\{0\.001if−10≤t≤−6ort≥\|E\|\+1,1if−5≤t≤−3,0\.8t\+3if−2≤t≤\|E\|\.W\_\{t\}=\\begin\{cases\}0\.001&\\text\{if \}\-10\\leq t\\leq\-6\\text\{ or \}t\\geq\|E\|\+1,\\\\ 1&\\text\{if \}\-5\\leq t\\leq\-3,\\\\ 0\.8^\{t\+3\}&\\text\{if \}\-2\\leq t\\leq\|E\|\.\\end\{cases\}\(10\)
#### Design Rationale\.
The weight schedule reflects three regimes:
- •Strict\-correctness regime\(t∈\[−5,−3\]t\\in\[\-5,\-3\],Wt=1W\_\{t\}=1\): Production\-level accuracy; full weight\.
- •Relaxed regime\(t∈\[−2,\|E\|\]t\\in\[\-2,\|E\|\],Wt=0\.8t\+3W\_\{t\}=0\.8^\{t\+3\}\): Exponential decay as correctness becomes easier to satisfy\.
- •Extreme regime\(t≤−6t\\leq\-6ort≥\|E\|\+1t\\geq\|E\|\+1,Wt=0\.001W\_\{t\}=0\.001\): Near\-zero weight to avoid distortion from ultra\-strict or ultra\-relaxed tolerances\.
## Appendix FConfiguration of atol\(t\) and rtol\(t\)
We perform log\-linear interpolation between reference points \(e\.g\.,atolfp32\(−5\)=10−5\\mathrm\{atol\}\_\{\\mathrm\{fp32\}\}\(\-5\)=10^\{\-5\}andatolfp32\(0\)=1\\mathrm\{atol\}\_\{\\mathrm\{fp32\}\}\(0\)=1\) such thatatol\(t\),rtol\(t\)=10kt\\mathrm\{atol\}\(t\),\\mathrm\{rtol\}\(t\)=10^\{kt\}\.
Table 5:atol configuration \(abbreviated\)
Table 6:rtol configuration \(abbreviated\)
## Appendix GTorchInductor Default Pipeline Profiling
To empirically quantify the long\-tail optimization gap discussed in Section[1](https://arxiv.org/html/2605.29357#S1), we profile TorchInductor’s default pass pipeline \(torch\.compile\(mode="default"\)\) on a large\-scale subgraph corpus\.
#### Setup\.
We extract9,5269\{,\}526subgraphs from1,0001\{,\}000community models viatorch\.fxgraph tracing: 56\.9% from HuggingFace Transformers \(BERT, T5, LLaVA, Qwen\-VL, etc\.\) and 43\.1% from timm \(ResNet, EfficientNet, ViT, FastViT, etc\.\)\. Subgraph node counts range from 4 to 63 \(median 16\)\. Experiments run on NVIDIA A30 GPUs \(24 GiB\) with PyTorch 2\.9\.1 and CUDA 12\.8\. Each subgraph is benchmarked with2020warmup iterations and100100timed trials; we report median latencies and reject samples with unstable timing \(IQR/median\>20%\>20\\%\)\.
#### Compilation Success\.
Of the9,5269\{,\}526subgraphs, 84\.5% compile successfully and pass correctness verification\. The majority of failures \(72\.8%\) stem from GPU timing instability in the shared cloud environment \(IQR/median exceeding our 20% stability threshold\); these samples are excluded as invalid measurements\. The compiler\-specific failure rate is 3\.1%, concentrated on non\-standard architectures \(LeViT, GCViT\) with unconventional reshape and attention\-head operations\.
#### Kernel Speedup Distribution\.
Table[7](https://arxiv.org/html/2605.29357#A7.T7)reports the kernel\-level speedup distribution over the8,0218\{,\}021subgraphs with valid performance data\.
Table 7:Kernel speedup distribution under TorchInductor’s default pipeline\.Over one\-third of subgraphs receive less than1\.2×1\.2\\timeskernel acceleration\.
#### End\-to\-End Slowdown Analysis\.
Of the 43% of subgraphs that exhibit E2E slowdowns, the majority are*not*caused by kernel\-level regression\. Table[8](https://arxiv.org/html/2605.29357#A7.T8)decomposes the3,4793\{,\}479E2E bad cases by root cause\.
Table 8:Root\-cause decomposition of E2E slowdowns\.In 80\.2% of cases, kernels are actually faster \(median1\.21×1\.21\\times\), but a fixed dispatch overhead of∼0\.14\{\\sim\}0\.14ms fromtorch\.compilenegates the gain\.Subgraph size is the primary determinant: the E2E bad rate drops from 85\.8% for subgraphs with 1–5 nodes to 13\.0% for those with 30–50 nodes \(median subgraph size: 9 nodes for bad cases vs\. 29 for good cases\)\. This confirms that the E2E overhead is largely a framework infrastructure cost orthogonal to pass\-level optimization quality\. True kernel degradation accounts for only 18\.3% of E2E bad cases and is concentrated on FastViT\-family architectures with non\-standard attention mechanisms\.
#### Numerical Precision\.
Among the8,0278\{,\}027correctly compiled subgraphs, 39\.7% produce bitwise\-identical outputs and 81\.5% exhibit maximum absolute difference below10−610^\{\-6\}\. A small fraction \(4\.6%\) shows deviations≥10−3\\geq 10^\{\-3\}, typically correlated with aggressive fusion strategies that yield higher speedups \(mean kernel speedup3\.03×3\.03\\timesvs\.2\.09×2\.09\\timesfor precise cases\)\.
#### ES\(tt\) Scores\.
Evaluating TorchInductor under ourEStES\_\{t\}metric yields a G\-Mean Speedup of0\.8460\.846and an AS Score of0\.7060\.706\(Table[1](https://arxiv.org/html/2605.29357#S4.T1)\), confirming substantial room for improvement when correctness and stability are jointly considered\.
## Appendix HDetailed Sparkle Case Analysis
This appendix provides the complete pass files for the sparkle cases discussed in Section[4](https://arxiv.org/html/2605.29357#S4)\.
### H\.1Case 1: Index\-Arithmetic Roll\+Slice Fusion \(MaskFormer\)
The MaskFormer pass targets an 8\-operator subgraph in MaskFormer’s pixel decoder\. Its operator sequence iscontiguous→\\toview→\\toroll→\\toslice→\\tocontiguous→\\toview→\\toadd→\\tolayer\_norm; under default compilation, this launches 6 separate kernels\. The LLM recognizes thatroll\(shift=3\)\+slice\[:128\]on a\[S,S,C\]\[S,S,C\]tensor can be replaced by direct index arithmeticinput\_i=\(S\+i−shift\)modS\\text\{input\\\_i\}=\(S\+i\-\\text\{shift\}\)\\bmod S, and additionally fuses the layer\-norm reduction via shared\-memory accumulation\. Crucially, the replacement producestwo outputs\(add result and layer\-norm result\) in a single kernel\.
The pass is split into ashared lowering moduleand ashape\-specific pattern pass\. The shared module contains the CUDA kernel, the inline loader registration, and a routing dispatch that parameterizes the kernel for three Swin\-Transformer stages \(D=96, 192, 384\)\. All shape\-specific passes import the same shared module and differ only in theirpatternandreplacement\_args, demonstrating the reusability of a single lowering across multiple FX pattern instances\.
Shared lowering module \(shared\_fused\_roll\_add\_ln\.pywith inline CUDA\):
```
import torch
from torch.utils.cpp_extension import load_inline
_cuda_src = r"""
#include <torch/extension.h>
#include <cuda_runtime.h>
template<typename scalar_t>
__global__ void fused_roll_slice_add_layernorm_kernel(
const scalar_t* __restrict__ input_6d,
const scalar_t* __restrict__ residual,
const scalar_t* __restrict__ weight,
const scalar_t* __restrict__ bias,
scalar_t* __restrict__ out_add,
scalar_t* __restrict__ out_ln,
const int R, const int C, const int S, const int SC, const int shift
) {
const int row = blockIdx.x;
if (row >= R) return;
extern __shared__ float smem[];
const int tid = threadIdx.x;
// Index arithmetic: roll(shift,dims=[0,1]) on [S,S,C] then slice [:SC,:SC,:]
const int i = row / SC;
const int j = row % SC;
const int input_i = (S + i - shift) % S;
const int input_j = (S + j - shift) % S;
// Pass 1: accumulate mean of (residual + rolled_input) via shared memory
float local_sum = 0.0f;
for (int col = tid; col < C; col += blockDim.x) {
float rolled = static_cast<float>(
input_6d[input_i * S * C + input_j * C + col]);
float res = static_cast<float>(residual[row * C + col]);
local_sum += res + rolled;
}
smem[tid] = local_sum;
// shared-memory reduction to mean ...
const float mean = smem[0] / static_cast<float>(C);
// Pass 2: accumulate variance
float local_var = 0.0f;
for (int col = tid; col < C; col += blockDim.x) {
float rolled = static_cast<float>(
input_6d[input_i * S * C + input_j * C + col]);
float res = static_cast<float>(residual[row * C + col]);
float diff = (res + rolled) - mean;
local_var += diff * diff;
}
smem[tid] = local_var;
// shared-memory reduction to variance ...
const float inv_std = rsqrtf(smem[0] / static_cast<float>(C) + 1e-5f);
// Pass 3: write both outputs
for (int col = tid; col < C; col += blockDim.x) {
float rolled = static_cast<float>(
input_6d[input_i * S * C + input_j * C + col]);
float res = static_cast<float>(residual[row * C + col]);
float added = res + rolled;
out_add[row * C + col] = static_cast<scalar_t>(added);
float normed = (added - mean) * inv_std;
out_ln[row * C + col] = static_cast<scalar_t>(
normed * static_cast<float>(weight[col])
+ static_cast<float>(bias[col]));
}
}
// Host wrapper omitted for brevity; allocates out_add/out_ln,
// chooses threads via next-power-of-two heuristics, and launches.
"""
_ext = load_inline(
name="fused_roll_slice_add_layernorm_ext",
cuda_sources=_cuda_src,
functions=["fused_roll_slice_add_layernorm_cuda"],
verbose=False,
)
@torch.fx.wrap
def fused_roll_slice_add_layernorm_dispatch(in_3, in_2, in_1, in_0, route):
if route == "D96":
return _ext.fused_roll_slice_add_layernorm_cuda(
in_3, in_2, in_1, in_0, 133, 128, 3, 1e-5)
elif route == "D192":
return _ext.fused_roll_slice_add_layernorm_cuda(
in_3, in_2, in_1, in_0, 70, 64, 3, 1e-5)
elif route == "D384":
return _ext.fused_roll_slice_add_layernorm_cuda(
in_3, in_2, in_1, in_0, 35, 32, 3, 1e-5)
else:
raise ValueError(f"Unknown route: {route}")
```
Shape\-specific pattern pass \(FusedRollSliceAddLayerNorm\_96\.py, D=96 stage\):
```
import torch
from pass_dir.shared_fused_roll_add_ln import fused_roll_slice_add_layernorm_dispatch
def pattern(in_0, in_1, in_2, in_3):
tmp_2 = in_3.contiguous()
tmp_3 = tmp_2.view(-1, 133, 133, 96)
tmp_4 = torch.roll(tmp_3, shifts=(3, 3), dims=(1, 2))
tmp_5 = tmp_4[(slice(None, None, None), slice(None, 128, None),
slice(None, 128, None), slice(None, None, None))]
tmp_6 = tmp_5.contiguous()
tmp_7 = tmp_6.view(1, 16384, 96)
tmp_8 = in_2 + tmp_7
tmp_9 = torch.nn.functional.layer_norm(tmp_8, (96,), in_1, in_0, 1e-05)
return (tmp_8, tmp_9)
def replacement_args(in_0, in_1, in_2, in_3):
return (in_3, in_2, in_1, in_0, "D96")
def replacement_func():
return fused_roll_slice_add_layernorm_dispatch
```
### H\.2Case 2: Pattern\-Driven Masked Mean Pooling \(BGE\-Reranker\)
The BGE\-Reranker pass targets a 7\-operator subgraph in the sentence\-embedding module\. Its operator sequence iscast→\\tomul→\\tosum→\\tosum→\\toclamp→\\todiv→\\tocat; under default compilation, this launches 7 separate kernels because the compiler treats the two reductions independently\. The LLM recognizes the semantic intent as “masked mean pooling” and generates a rewrite that fuses all operators into a single kernel, accumulating∑\(mask⋅hidden\)\\sum\(\\text\{mask\}\\cdot\\text\{hidden\}\)and∑\(mask\)\\sum\(\\text\{mask\}\)simultaneously in float32 registers\.
Shared lowering module \(shared\_cuda\_bge\.pywith inline CUDA\):
```
import torch
from torch.utils.cpp_extension import load_inline
_cuda_src = r"""
#include <torch/extension.h>
#include <cuda_runtime.h>
template<typename hidden_t>
__global__ void masked_mean_pooling_kernel(
const float* __restrict__ mask,
const hidden_t* __restrict__ hidden,
float* __restrict__ output,
const int S, const int D
) {
const int bid = blockIdx.x;
const int d_idx = blockIdx.y * blockDim.x + threadIdx.x;
if (d_idx >= D) return;
const float* mask_batch = mask + bid * S * D;
const hidden_t* hidden_batch = hidden + bid * S * D;
float acc_val = 0.0f;
float acc_mask = 0.0f;
for (int s = 0; s < S; ++s) {
int offset = s * D + d_idx;
acc_val += static_cast<float>(hidden_batch[offset]) * mask_batch[offset];
acc_mask += mask_batch[offset];
}
output[bid * D + d_idx] = acc_val / (acc_mask > 1e-9f ? acc_mask : 1e-9f);
}
// Host wrapper omitted for brevity; converts mask to float32,
// allocates [B,D] float32 output, and launches with dim3(B, ceil(D/128)).
"""
_ext = load_inline(
name="fused_masked_mean_pooling_cuda_ext",
cuda_sources=_cuda_src,
functions=["fused_masked_mean_pooling_cuda"],
verbose=False,
)
@torch.fx.wrap
def fused_masked_mean_pooling(in_0, in_1):
return _ext.fused_masked_mean_pooling_cuda(in_0, in_1)
```
Pattern pass \(FuseMaskedMeanPooling\.py\):
```
import torch
from pass_dir.shared_cuda_bge import fused_masked_mean_pooling
def pattern(in_0, in_1):
tmp_0 = in_0.to(torch.float32)
tmp_1 = in_1 * tmp_0
tmp_2 = torch.sum(tmp_1, 1)
tmp_3 = tmp_0.sum(1)
tmp_4 = torch.clamp(tmp_3, min=1e-09)
tmp_5 = tmp_2 / tmp_4
tmp_6 = torch.cat([tmp_5], 1)
return tmp_6
def replacement_args(in_0, in_1):
return (in_0, in_1)
def replacement_func():
return fused_masked_mean_pooling
```
Key implementation notes\.The kernel performs two sequential reductions \(acc\_valandacc\_mask\) in float32 over the sequence dimensionSS, matching the eager\-mode cast\-to\-float32 semantics\. The output shape is\[B,D\]\[B,D\], whereBBis the batch size andDDis the hidden dimension\. Because the eager pattern includes a no\-opcat\(concatenating a single tensor along dim 1\), the replacement preserves the original output shape without additional reshaping\. The same rewrite rule applies to varying sequence lengths and hidden dimensions without recompilation\.Similar Articles
Are Large Language Models Suitable for Graph Computation? Progress and Prospects
This survey reviews the use of large language models for graph computation, categorizing them into two paradigms: LLMs as executors and LLMs as planners. It finds LLMs promising for simple tasks but unreliable for large-scale exact computations, and suggests future directions.
Nanopass Framework: Clean Compiler Creation Language
Nanopass Framework is a domain-specific language embedded in Scheme for creating compilers through small passes and intermediate representations, reducing boilerplate and improving maintainability.
Generating Robust Portfolios of Optimization Models using Large Language Models
Proposes a method to generate portfolios of optimization models using LLMs, with theoretical guarantees and empirical validation.
Exploring Lightweight Large Language Models for Court View Generation
This paper systematically explores the capabilities of lightweight (<2B) large language models for criminal court view generation, investigating trade-offs between model architecture, size, and impact on charge prediction. The authors also introduce CVGEvalKit, an evaluation framework with three public datasets.
Closed-Loop Graph Algorithm Execution with Small Language Models: Step Accuracy and Rollout Reliability
This paper studies small language models (SLMs) as closed-loop policies for graph algorithm execution, evaluating both step accuracy and rollout reliability across multiple graph procedures. The results show a gap between local decision quality and global execution reliability, especially for weighted algorithms.