KernelBench-X: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels
Summary
KernelBench-X is a new benchmark for evaluating LLM-generated GPU kernels, revealing that task structure impacts correctness more than method design and that correctness does not guarantee hardware efficiency.
View Cached Full Text
Cached at: 05/08/26, 10:53 AM
Paper page - KernelBench-X: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels
Source: https://huggingface.co/papers/2605.04956
Abstract
KernelBench-X benchmark reveals that task structure significantly impacts LLM-generated Triton kernel correctness more than method design, while iterative refinement improves correctness at the expense of performance, and correctness does not guarantee efficiency.
LLM-basedTriton kernel generationhas attracted significant interest, yet a fundamental empirical question remains unanswered: where does this capability break down, and why? We presentKernelBench-X, a benchmark designed to answer this question through category-aware evaluation ofcorrectnessandhardware efficiencyacross 176 tasks in 15 categories. Our systematic comparison of five representative methods yields three main findings. First, task structure determinescorrectnessmore than method design. Category explains nearly three times more variance in semanticcorrectnessthan method (9.4% vs 3.3% explained deviance), and 72% of Fusion tasks fail across all five methods while Math tasks are solved consistently. Second,iterative refinementimprovescorrectness, but not performance. Across GEAK iterations,compile raterises from 52.3% to 68.8% while averagespeedupdeclines from 1.58times to 1.44times; newly rescued kernels consistently underperform persistently correct ones (1.16times vs 1.58timesspeedupin round~0to1). Third,correctnessdoes not imply efficiency. 46.6% of correct kernels are slower than the PyTorch eager baseline, and cross-hardwarespeedupvariance reaches 21.4times. Besides,quantizationremains completely unsolved (0/30 successes) despite non-trivial compilation rates, revealing systematic misunderstanding of numerical computation contracts rather than surface-level syntax errors. These findings suggest that future progress depends on handling global coordination, explicitly modelingnumerical precision, and incorporatinghardware efficiencyinto generation. The code is available at https://github.com/BonnieW05/KernelBenchX
View arXiv pageView PDFProject pageGitHub14Add to collection
Get this paper in your agent:
hf papers read 2605\.04956
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.04956 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.04956 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.04956 in a Space README.md to link it from this page.
Collections including this paper1
Similar Articles
AgentKernelArena: Generalization-Aware Benchmarking of GPU Kernel Optimization Agents
AgentKernelArena is an open-source benchmark for evaluating AI coding agents on GPU kernel optimization, assessing full agent workflows and generalization to unseen configurations across 196 tasks.
Optimizing CUDA like a Human: Micro-Profiling Tools as Expert Surrogates for LLM-Based GPU Kernel Optimization
KernelPro is a closed-loop multi-agent system that uses LLMs and micro-profiling tools to automatically optimize GPU kernel code, achieving geomean speedups of 2.42×/4.69×/5.30× on KernelBench and demonstrating a measured 11.6% energy reduction at matched speed.
LinAlg-Bench: A Forensic Benchmark Revealing Structural Failure Modes in LLM Mathematical Reasoning
Introduces LinAlg-Bench, a diagnostic benchmark evaluating 10 frontier LLMs on structured linear algebra computation across matrix dimensions, revealing that LLM mathematical failure is structurally constrained and transitions from execution errors to computational abandonment at 4x4 scale.
Benchmarks of 20 small LLMs on a 6GB RTX 4050
A detailed benchmark of 20 small LLMs quantized for a 6GB GPU, measuring speed and VRAM usage at various context lengths, with qualitative probing for tool-use and instruction following. The report aims to help users with modest hardware choose models for local, private automation tasks.
Built a tool that tells you exactly which LLMs fit on your GPU. Feedback wanted.
A tool that estimates which LLMs fit on a user's GPU memory, ranking models by performance while considering memory constraints and quantization levels.