AccelOpt: A Self-Improving LLM Agentic System for AI Accelerator Kernel Optimization
Summary
AccelOpt is a self-improving LLM agentic system that autonomously optimizes AI accelerator kernels through iterative generation and optimization memory, achieving 49-61% peak throughput improvements on AWS Trainium while being 26x cheaper than Claude Sonnet 4.
View Cached Full Text
Cached at: 04/20/26, 08:26 AM
Paper page - AccelOpt: A Self-Improving LLM Agentic System for AI Accelerator Kernel Optimization
Source: https://huggingface.co/papers/2511.15915 https://huggingface.co/login?next=%2Fpapers%2F2511.15915-
Abstract
AccelOpt is a self-improving LLM agentic system that autonomously optimizes kernels for AI accelerators using iterative generation and optimization memory, achieving significant throughput improvements at reduced costs.
We present AccelOpt, a self-improving large language model (LLM) agentic system that autonomously optimizes kernels for emerging AI accelerators, eliminating the need for expert-provided hardware-specific optimization knowledge. AccelOpt explores the kernel optimization space through iterative generation, informed by an optimization memory that curates experiences and insights from previously encountered slow-fast kernel pairs. We build NKIBench, a new benchmark suite of AWS Trainium accelerator kernels with varying complexity extracted from real-world LLM workloads to evaluate the effectiveness of AccelOpt. Our evaluation confirms that AccelOpt’s capability improves over time, boosting the average percentage of peak throughput from 49% to 61% on Trainium 1 and from 45% to 59% on Trainium 2 for NKIBench kernels. Moreover, AccelOpt is highly cost-effective: using open-source models, it matches the kernel improvements of Claude Sonnet 4 while being 26 times cheaper. The code is open-sourced at https://github.com/zhang677/AccelOpt.
View arXiv page (https://arxiv.org/abs/2511.15915)View PDF (https://arxiv.org/pdf/2511.15915)Project page (https://ppl.stanford.edu/accelopt.html)GitHub33 (https://github.com/zhang677/AccelOpt)Add to collection (https://huggingface.co/login?next=%2Fpapers%2F2511.15915)
Community
Paper author
Paper submitter
about 3 hours ago (https://huggingface.co/papers/2511.15915#69e5b6764506644887a3196b)
•
edited about 3 hours ago (https://huggingface.co/papers/2511.15915#69e5b6764506644887a3196b)
- AccelOpt boosts the average percentage of peak throughput from 49% to 61% on Trainium 1 and from 45% to 59% on Trainium 2 for NKIBench kernels.
- AccelOpt is highly cost-effective: using open-source models, it matches the kernel improvements of Claude Sonnet 4 while being 26x cheaper.
- AccelOpt is agnostic to kernel languages. On 24 Triton kernels from FlashInfer-Bench (H100), AccelOpt with gpt-oss-120b achieved 1.27x average speedup over best Triton baselines, with 3.19x peak speedup on a GQA decoding kernel. Such adoption took the first author 3 days.
- In Stanford CS149 Fall 2025, a graduate-level parallel computing course, AccelOpt optimized a Conv2D kernel outside of NKIBench and achieved 48.8% of peak throughput, starting from last year’s reference implementation (9.54%). Based on the optimization proposed by AccelOpt, we designed an extra credit problem where 33.6% of 131 teams of students successfully conquered the challenge.
- AccelOpt paper was accepted by MLSys 2026.
main-method-shaowz (https://cdn-uploads.huggingface.co/production/uploads/65a76ff1e504d9738d636217/Bolv_a2d6tBc4ldX9wo3U.png)
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
https://huggingface.co/login?next=%2Fpapers%2F2511.15915-
Get this paper in your agent:
hf papers read 2511.15915
Don’t have the latest CLI?curl -LsSf https://hf.co/cli/install.sh | bash
Models citing this paper3
Genghan/sft-qwen-7b-instruct_GRPO_nki_pure_0920_cluster3 8B• Updatedabout 3 hours ago • 2 (https://huggingface.co/Genghan/sft-qwen-7b-instruct_GRPO_nki_pure_0920_cluster3)
Genghan/deepseek-coder-33b-instruct_GRPO_nki_pure_0907_cluster1 33B• Updatedabout 3 hours ago • 1 (https://huggingface.co/Genghan/deepseek-coder-33b-instruct_GRPO_nki_pure_0907_cluster1)
Genghan/sft-deepseek-coder-33b-instruct_GRPO_nki_pure_0921_cluster4 33B• Updatedabout 3 hours ago • 2 (https://huggingface.co/Genghan/sft-deepseek-coder-33b-instruct_GRPO_nki_pure_0921_cluster4)
Datasets citing this paper1
Genghan/NKIBench Updatedabout 3 hours ago • 21 (https://huggingface.co/datasets/Genghan/NKIBench)
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2511.15915 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to a collection (https://huggingface.co/new-collection) to link it from this page.
Similar Articles
KForge: LLM-Driven Cross-Platform Kernel Generation for AI Accelerators
KForge is a cross-platform framework that uses two collaborating LLM-based agents to automatically generate and optimize high-performance compute kernels for diverse AI accelerators, achieving significant speedups on NVIDIA B200 and Intel Arc B580 hardware.
@AlphaSignalAI: You can now boost any LLM's accuracy 2-10x without training it. Most teams improve model accuracy by fine-tuning or swa…
OptiLLM is an open-source proxy that boosts any LLM's accuracy 2-10x by adding extra compute at inference time, using techniques like multi-agent cross-verification and Monte Carlo tree search.
TokenSpeed: A Speed-of-Light LLM Inference Engine for Agentic Workloads (5 minute read)
Lightseek releases TokenSpeed, a high-performance LLM inference engine optimized for agentic workloads, featuring compiler-backed parallelism and advanced kernel optimizations that have been adopted by vLLM.
We stopped optimizing our LLM stack manually — it optimizes itself now
The article describes a company's transition to a self-optimizing LLM stack that uses production traces to automatically route requests and fine-tune models, resulting in significant cost reductions and performance improvements.
Ada-MK: Adaptive MegaKernel Optimization via Automated DAG-based Search for LLM Inference
This paper introduces Ada-MK, an adaptive MegaKernel optimization method that uses automated DAG-based search to eliminate runtime branching and reduce shared memory usage for LLM inference. It demonstrates significant throughput improvements on NVIDIA Ada GPUs by integrating with TensorRT-LLM, achieving up to 23.6% faster performance than vanilla TensorRT-LLM in commercial advertising systems.