AgentKernelArena: Generalization-Aware Benchmarking of GPU Kernel Optimization Agents
Summary
AgentKernelArena is an open-source benchmark for evaluating AI coding agents on GPU kernel optimization, assessing full agent workflows and generalization to unseen configurations across 196 tasks.
View Cached Full Text
Cached at: 05/19/26, 06:30 AM
Paper page - AgentKernelArena: Generalization-Aware Benchmarking of GPU Kernel Optimization Agents
Source: https://huggingface.co/papers/2605.16819 Authors:
,
,
,
,
,
,
,
,
,
,
,
,
Abstract
AgentKernelArena is introduced as an open-source benchmark for evaluating AI coding agents on GPU kernel optimization, assessing full agent workflows and unseen-configuration generalization across multiple optimization tasks.
GPU kernel optimizationis increasingly critical for efficient deep learning systems, but writing high-performance kernels still requires substantial low-level expertise. RecentAI coding agentscan iteratively read code, invoke compilers and profilers, and refine implementations, yet existing kernel benchmarks evaluate single LLM calls rather than fullagent workflows, and none include both kernel-to-kernel optimization andunseen-configuration generalizationtesting. We present AgentKernelArena, an open-source benchmark for measuringAI coding agentsonGPU kernel optimization. The benchmark contains 196 tasks spanningHIP-to-HIP optimization,Triton-to-Triton optimization, andPyTorch-to-HIP translation, and evaluates completeagent workflowsin isolated workspaces using gatedcompilation,correctness, andperformance checks, centralized scoring and anunseen-configuration generalizationprotocol that tests whether optimizations transfer to input configurations the agent never observed. Across production agents including Cursor Agent, Claude Code, and Codex Agent, we find near-perfectcompilationand highcorrectnessrates on most task categories, with the strongest configurations achieving mean speedups of up to 6.89x on PyTorch-to-HIP, 6.69x on HIP-to-HIP, and 2.13x on Triton-to-Triton tasks. Our unseen-configuration evaluation shows that HIP-to-HIP andTriton-to-Triton optimizations largely transfer to unseen input shapes, while PyTorch-to-HIP exhibits substantialcorrectnessdrops, indicating that agents generating kernels from scratch frequently hardcode shape-specific assumptions. AgentKernelArena is designed as a modular, extensible framework for rigorous evaluation of agenticGPU kernel optimizationacross agents, tasks, and hardware targets.
View arXiv pageView PDFGitHub16Add to collection
Get this paper in your agent:
hf papers read 2605\.16819
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.16819 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.16819 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.16819 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
AdaExplore: Failure-Driven Adaptation and Diversity-Preserving Search for Efficient Kernel Generation
Researchers from Carnegie Mellon, University of Washington, and Arm propose AdaExplore, an LLM agent framework for GPU kernel code generation that achieves 3.12× and 1.72× speedups on KernelBench Level-2 and Level-3 benchmarks through failure-driven adaptation and diversity-preserving search, without additional fine-tuning.
KernelBench-X: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels
KernelBench-X is a new benchmark for evaluating LLM-generated GPU kernels, revealing that task structure impacts correctness more than method design and that correctness does not guarantee hardware efficiency.
AA introduces Coding Agent Index - Performance Comparisons between Model & Harness Combinations
Artificial Analysis introduces the Coding Agent Index, a new benchmark suite combining SWE-Bench-Pro-Hard-AA, Terminal-Bench v2, and SWE-Atlas-QnA to evaluate the performance of AI coding agents across diverse tasks.
There is no benchmark for the agent that merged your pull request.
Artificial Analysis launched a coding agent index that tests harness and model combinations separately, highlighting that benchmark tasks differ from real production needs. The article argues that teams should evaluate agent configurations on their own codebases and workflows rather than relying solely on standardized benchmarks.
KForge: LLM-Driven Cross-Platform Kernel Generation for AI Accelerators
KForge is a cross-platform framework that uses two collaborating LLM-based agents to automatically generate and optimize high-performance compute kernels for diverse AI accelerators, achieving significant speedups on NVIDIA B200 and Intel Arc B580 hardware.