KernelBench-X: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

Hugging Face Daily Papers 05/06/26, 12:00 AM Papers

benchmark llm-code-generation gpu-kernels triton evaluation hardware-efficiency

Summary

KernelBench-X is a new benchmark for evaluating LLM-generated GPU kernels, revealing that task structure impacts correctness more than method design and that correctness does not guarantee hardware efficiency.

LLM-based Triton kernel generation has attracted significant interest, yet a fundamental empirical question remains unanswered: where does this capability break down, and why? We present KernelBench-X, a benchmark designed to answer this question through category-aware evaluation of correctness and hardware efficiency across 176 tasks in 15 categories. Our systematic comparison of five representative methods yields three main findings. First, task structure determines correctness more than method design. Category explains nearly three times more variance in semantic correctness than method (9.4% vs 3.3% explained deviance), and 72% of Fusion tasks fail across all five methods while Math tasks are solved consistently. Second, iterative refinement improves correctness, but not performance. Across GEAK iterations, compile rate rises from 52.3% to 68.8% while average speedup declines from 1.58times to 1.44times; newly rescued kernels consistently underperform persistently correct ones (1.16times vs 1.58times speedup in round~0to1). Third, correctness does not imply efficiency. 46.6% of correct kernels are slower than the PyTorch eager baseline, and cross-hardware speedup variance reaches 21.4times. Besides, quantization remains completely unsolved (0/30 successes) despite non-trivial compilation rates, revealing systematic misunderstanding of numerical computation contracts rather than surface-level syntax errors. These findings suggest that future progress depends on handling global coordination, explicitly modeling numerical precision, and incorporating hardware efficiency into generation. The code is available at https://github.com/BonnieW05/KernelBenchX

Original Article

View Cached Full Text

Cached at: 05/08/26, 10:53 AM

Paper page - KernelBench-X: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

Source: https://huggingface.co/papers/2605.04956

Abstract

KernelBench-X benchmark reveals that task structure significantly impacts LLM-generated Triton kernel correctness more than method design, while iterative refinement improves correctness at the expense of performance, and correctness does not guarantee efficiency.

LLM-basedTriton kernel generationhas attracted significant interest, yet a fundamental empirical question remains unanswered: where does this capability break down, and why? We presentKernelBench-X, a benchmark designed to answer this question through category-aware evaluation ofcorrectnessandhardware efficiencyacross 176 tasks in 15 categories. Our systematic comparison of five representative methods yields three main findings. First, task structure determinescorrectnessmore than method design. Category explains nearly three times more variance in semanticcorrectnessthan method (9.4% vs 3.3% explained deviance), and 72% of Fusion tasks fail across all five methods while Math tasks are solved consistently. Second,iterative refinementimprovescorrectness, but not performance. Across GEAK iterations,compile raterises from 52.3% to 68.8% while averagespeedupdeclines from 1.58times to 1.44times; newly rescued kernels consistently underperform persistently correct ones (1.16times vs 1.58timesspeedupin round~0to1). Third,correctnessdoes not imply efficiency. 46.6% of correct kernels are slower than the PyTorch eager baseline, and cross-hardwarespeedupvariance reaches 21.4times. Besides,quantizationremains completely unsolved (0/30 successes) despite non-trivial compilation rates, revealing systematic misunderstanding of numerical computation contracts rather than surface-level syntax errors. These findings suggest that future progress depends on handling global coordination, explicitly modelingnumerical precision, and incorporatinghardware efficiencyinto generation. The code is available at https://github.com/BonnieW05/KernelBenchX

View arXiv page View PDF Project page GitHub14 Add to collection

Get this paper in your agent:

hf papers read 2605\.04956

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.04956 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.04956 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.04956 in a Space README.md to link it from this page.

KernelBench-X: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

Paper page - KernelBench-X: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper1

Similar Articles

AgentKernelArena: Generalization-Aware Benchmarking of GPU Kernel Optimization Agents

Optimizing CUDA like a Human: Micro-Profiling Tools as Expert Surrogates for LLM-Based GPU Kernel Optimization

LinAlg-Bench: A Forensic Benchmark Revealing Structural Failure Modes in LLM Mathematical Reasoning

Benchmarks of 20 small LLMs on a 6GB RTX 4050

Built a tool that tells you exactly which LLMs fit on your GPU. Feedback wanted.

Submit Feedback

Similar Articles

AgentKernelArena: Generalization-Aware Benchmarking of GPU Kernel Optimization Agents

Optimizing CUDA like a Human: Micro-Profiling Tools as Expert Surrogates for LLM-Based GPU Kernel Optimization

LinAlg-Bench: A Forensic Benchmark Revealing Structural Failure Modes in LLM Mathematical Reasoning

Benchmarks of 20 small LLMs on a 6GB RTX 4050

Built a tool that tells you exactly which LLMs fit on your GPU. Feedback wanted.