AccelOpt: A Self-Improving LLM Agentic System for AI Accelerator Kernel Optimization
Summary
AccelOpt is a self-improving LLM agentic system that autonomously optimizes AI accelerator kernels through iterative generation and optimization memory, achieving 49-61% peak throughput improvements on AWS Trainium while being 26x cheaper than Claude Sonnet 4.
View Cached Full Text
Cached at: 04/20/26, 08:26 AM
Paper page - AccelOpt: A Self-Improving LLM Agentic System for AI Accelerator Kernel Optimization
Source: https://huggingface.co/papers/2511.15915 https://huggingface.co/login?next=%2Fpapers%2F2511.15915-
Abstract
AccelOpt is a self-improving LLM agentic system that autonomously optimizes kernels for AI accelerators using iterative generation and optimization memory, achieving significant throughput improvements at reduced costs.
We present AccelOpt, a self-improving large language model (LLM) agentic system that autonomously optimizes kernels for emerging AI accelerators, eliminating the need for expert-provided hardware-specific optimization knowledge. AccelOpt explores the kernel optimization space through iterative generation, informed by an optimization memory that curates experiences and insights from previously encountered slow-fast kernel pairs. We build NKIBench, a new benchmark suite of AWS Trainium accelerator kernels with varying complexity extracted from real-world LLM workloads to evaluate the effectiveness of AccelOpt. Our evaluation confirms that AccelOpt’s capability improves over time, boosting the average percentage of peak throughput from 49% to 61% on Trainium 1 and from 45% to 59% on Trainium 2 for NKIBench kernels. Moreover, AccelOpt is highly cost-effective: using open-source models, it matches the kernel improvements of Claude Sonnet 4 while being 26 times cheaper. The code is open-sourced at https://github.com/zhang677/AccelOpt.
View arXiv page (https://arxiv.org/abs/2511.15915)View PDF (https://arxiv.org/pdf/2511.15915)Project page (https://ppl.stanford.edu/accelopt.html)GitHub33 (https://github.com/zhang677/AccelOpt)Add to collection (https://huggingface.co/login?next=%2Fpapers%2F2511.15915)
Community
Paper author
Paper submitter
about 3 hours ago (https://huggingface.co/papers/2511.15915#69e5b6764506644887a3196b)
•
edited about 3 hours ago (https://huggingface.co/papers/2511.15915#69e5b6764506644887a3196b)
- AccelOpt boosts the average percentage of peak throughput from 49% to 61% on Trainium 1 and from 45% to 59% on Trainium 2 for NKIBench kernels.
- AccelOpt is highly cost-effective: using open-source models, it matches the kernel improvements of Claude Sonnet 4 while being 26x cheaper.
- AccelOpt is agnostic to kernel languages. On 24 Triton kernels from FlashInfer-Bench (H100), AccelOpt with gpt-oss-120b achieved 1.27x average speedup over best Triton baselines, with 3.19x peak speedup on a GQA decoding kernel. Such adoption took the first author 3 days.
- In Stanford CS149 Fall 2025, a graduate-level parallel computing course, AccelOpt optimized a Conv2D kernel outside of NKIBench and achieved 48.8% of peak throughput, starting from last year’s reference implementation (9.54%). Based on the optimization proposed by AccelOpt, we designed an extra credit problem where 33.6% of 131 teams of students successfully conquered the challenge.
- AccelOpt paper was accepted by MLSys 2026.
main-method-shaowz (https://cdn-uploads.huggingface.co/production/uploads/65a76ff1e504d9738d636217/Bolv_a2d6tBc4ldX9wo3U.png)
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
https://huggingface.co/login?next=%2Fpapers%2F2511.15915-
Get this paper in your agent:
hf papers read 2511.15915
Don’t have the latest CLI?curl -LsSf https://hf.co/cli/install.sh | bash
Models citing this paper3
Genghan/sft-qwen-7b-instruct_GRPO_nki_pure_0920_cluster3 8B• Updatedabout 3 hours ago • 2 (https://huggingface.co/Genghan/sft-qwen-7b-instruct_GRPO_nki_pure_0920_cluster3)
Genghan/deepseek-coder-33b-instruct_GRPO_nki_pure_0907_cluster1 33B• Updatedabout 3 hours ago • 1 (https://huggingface.co/Genghan/deepseek-coder-33b-instruct_GRPO_nki_pure_0907_cluster1)
Genghan/sft-deepseek-coder-33b-instruct_GRPO_nki_pure_0921_cluster4 33B• Updatedabout 3 hours ago • 2 (https://huggingface.co/Genghan/sft-deepseek-coder-33b-instruct_GRPO_nki_pure_0921_cluster4)
Datasets citing this paper1
Genghan/NKIBench Updatedabout 3 hours ago • 21 (https://huggingface.co/datasets/Genghan/NKIBench)
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2511.15915 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to a collection (https://huggingface.co/new-collection) to link it from this page.
Similar Articles
TokenSpeed: A Speed-of-Light LLM Inference Engine for Agentic Workloads (5 minute read)
Lightseek releases TokenSpeed, a high-performance LLM inference engine optimized for agentic workloads, featuring compiler-backed parallelism and advanced kernel optimizations that have been adopted by vLLM.
I've created the fastest local AI engine for Apple Silicon. Optimised for agentic use.
The author announces the release of 'lightning-mlx', a local AI engine optimized for Apple Silicon that achieves high token speeds for coding agents and tool-calling workflows.
AdaExplore: Failure-Driven Adaptation and Diversity-Preserving Search for Efficient Kernel Generation
Researchers from Carnegie Mellon, University of Washington, and Arm propose AdaExplore, an LLM agent framework for GPU kernel code generation that achieves 3.12× and 1.72× speedups on KernelBench Level-2 and Level-3 benchmarks through failure-driven adaptation and diversity-preserving search, without additional fine-tuning.
Self-Evolving LLM Memory Extraction Across Heterogeneous Tasks
Researchers introduce BEHEMOTH benchmark and CluE cluster-based prompt optimization to enable LLMs to extract and retain heterogeneous memory across diverse tasks, achieving 9% gains over prior self-evolving frameworks.
AMD AI ENGAGE
The article discusses the AMD AI Engage Program, a community initiative for AI developers offering prizes, credits, and networking opportunities for building LLM apps and GenAI workflows.