AccelOpt: A Self-Improving LLM Agentic System for AI Accelerator Kernel Optimization

Hugging Face Daily Papers 04/15/26, 12:00 AM Papers

llm-agents kernel-optimization ai-accelerators self-improving aws-trainium open-source

Summary

AccelOpt is a self-improving LLM agentic system that autonomously optimizes AI accelerator kernels through iterative generation and optimization memory, achieving 49-61% peak throughput improvements on AWS Trainium while being 26x cheaper than Claude Sonnet 4.

We present AccelOpt, a self-improving large language model (LLM) agentic system that autonomously optimizes kernels for emerging AI accelerators, eliminating the need for expert-provided hardware-specific optimization knowledge. AccelOpt explores the kernel optimization space through iterative generation, informed by an optimization memory that curates experiences and insights from previously encountered slow-fast kernel pairs. We build NKIBench, a new benchmark suite of AWS Trainium accelerator kernels with varying complexity extracted from real-world LLM workloads to evaluate the effectiveness of AccelOpt. Our evaluation confirms that AccelOpt's capability improves over time, boosting the average percentage of peak throughput from 49% to 61% on Trainium 1 and from 45% to 59% on Trainium 2 for NKIBench kernels. Moreover, AccelOpt is highly cost-effective: using open-source models, it matches the kernel improvements of Claude Sonnet 4 while being 26 times cheaper. The code is open-sourced at https://github.com/zhang677/AccelOpt.

Original Article Export to Word Export to PDF

View Cached Full Text

Cached at: 04/20/26, 08:26 AM

Paper page - AccelOpt: A Self-Improving LLM Agentic System for AI Accelerator Kernel Optimization

Source: https://huggingface.co/papers/2511.15915 https://huggingface.co/login?next=%2Fpapers%2F2511.15915-

Abstract

AccelOpt is a self-improving LLM agentic system that autonomously optimizes kernels for AI accelerators using iterative generation and optimization memory, achieving significant throughput improvements at reduced costs.

We present AccelOpt, a self-improving large language model (LLM) agentic system that autonomously optimizes kernels for emerging AI accelerators, eliminating the need for expert-provided hardware-specific optimization knowledge. AccelOpt explores the kernel optimization space through iterative generation, informed by an optimization memory that curates experiences and insights from previously encountered slow-fast kernel pairs. We build NKIBench, a new benchmark suite of AWS Trainium accelerator kernels with varying complexity extracted from real-world LLM workloads to evaluate the effectiveness of AccelOpt. Our evaluation confirms that AccelOpt’s capability improves over time, boosting the average percentage of peak throughput from 49% to 61% on Trainium 1 and from 45% to 59% on Trainium 2 for NKIBench kernels. Moreover, AccelOpt is highly cost-effective: using open-source models, it matches the kernel improvements of Claude Sonnet 4 while being 26 times cheaper. The code is open-sourced at https://github.com/zhang677/AccelOpt.

View arXiv page (https://arxiv.org/abs/2511.15915)View PDF (https://arxiv.org/pdf/2511.15915)Project page (https://ppl.stanford.edu/accelopt.html)GitHub33 (https://github.com/zhang677/AccelOpt)Add to collection (https://huggingface.co/login?next=%2Fpapers%2F2511.15915)

Community

Paper author

Paper submitter

about 3 hours ago (https://huggingface.co/papers/2511.15915#69e5b6764506644887a3196b)

•

edited about 3 hours ago (https://huggingface.co/papers/2511.15915#69e5b6764506644887a3196b)

AccelOpt boosts the average percentage of peak throughput from 49% to 61% on Trainium 1 and from 45% to 59% on Trainium 2 for NKIBench kernels.
AccelOpt is highly cost-effective: using open-source models, it matches the kernel improvements of Claude Sonnet 4 while being 26x cheaper.
AccelOpt is agnostic to kernel languages. On 24 Triton kernels from FlashInfer-Bench (H100), AccelOpt with gpt-oss-120b achieved 1.27x average speedup over best Triton baselines, with 3.19x peak speedup on a GQA decoding kernel. Such adoption took the first author 3 days.
In Stanford CS149 Fall 2025, a graduate-level parallel computing course, AccelOpt optimized a Conv2D kernel outside of NKIBench and achieved 48.8% of peak throughput, starting from last year’s reference implementation (9.54%). Based on the optimization proposed by AccelOpt, we designed an extra credit problem where 33.6% of 131 teams of students successfully conquered the challenge.
AccelOpt paper was accepted by MLSys 2026.

main-method-shaowz (https://cdn-uploads.huggingface.co/production/uploads/65a76ff1e504d9738d636217/Bolv_a2d6tBc4ldX9wo3U.png)

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

https://huggingface.co/login?next=%2Fpapers%2F2511.15915-

Get this paper in your agent:

hf papers read 2511.15915

Don’t have the latest CLI?curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper3

Genghan/sft-qwen-7b-instruct_GRPO_nki_pure_0920_cluster3 8B• Updatedabout 3 hours ago • 2 (https://huggingface.co/Genghan/sft-qwen-7b-instruct_GRPO_nki_pure_0920_cluster3)

Genghan/deepseek-coder-33b-instruct_GRPO_nki_pure_0907_cluster1 33B• Updatedabout 3 hours ago • 1 (https://huggingface.co/Genghan/deepseek-coder-33b-instruct_GRPO_nki_pure_0907_cluster1)

Genghan/sft-deepseek-coder-33b-instruct_GRPO_nki_pure_0921_cluster4 33B• Updatedabout 3 hours ago • 2 (https://huggingface.co/Genghan/sft-deepseek-coder-33b-instruct_GRPO_nki_pure_0921_cluster4)

Datasets citing this paper1

Genghan/NKIBench Updatedabout 3 hours ago • 21 (https://huggingface.co/datasets/Genghan/NKIBench)

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2511.15915 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to a collection (https://huggingface.co/new-collection) to link it from this page.

AccelOpt: A Self-Improving LLM Agentic System for AI Accelerator Kernel Optimization

Paper page - AccelOpt: A Self-Improving LLM Agentic System for AI Accelerator Kernel Optimization

Abstract

Community

Models citing this paper3

Genghan/sft-qwen-7b-instruct_GRPO_nki_pure_0920_cluster3 8B• Updatedabout 3 hours ago • 2 (https://huggingface.co/Genghan/sft-qwen-7b-instruct_GRPO_nki_pure_0920_cluster3)

Genghan/deepseek-coder-33b-instruct_GRPO_nki_pure_0907_cluster1 33B• Updatedabout 3 hours ago • 1 (https://huggingface.co/Genghan/deepseek-coder-33b-instruct_GRPO_nki_pure_0907_cluster1)

Genghan/sft-deepseek-coder-33b-instruct_GRPO_nki_pure_0921_cluster4 33B• Updatedabout 3 hours ago • 2 (https://huggingface.co/Genghan/sft-deepseek-coder-33b-instruct_GRPO_nki_pure_0921_cluster4)

Datasets citing this paper1

Genghan/NKIBench Updatedabout 3 hours ago • 21 (https://huggingface.co/datasets/Genghan/NKIBench)

Spaces citing this paper0

Collections including this paper0

Similar Articles

TokenSpeed: A Speed-of-Light LLM Inference Engine for Agentic Workloads (5 minute read)

I've created the fastest local AI engine for Apple Silicon. Optimised for agentic use.

AdaExplore: Failure-Driven Adaptation and Diversity-Preserving Search for Efficient Kernel Generation

Self-Evolving LLM Memory Extraction Across Heterogeneous Tasks

AMD AI ENGAGE

Submit Feedback

Similar Articles

TokenSpeed: A Speed-of-Light LLM Inference Engine for Agentic Workloads (5 minute read)

I've created the fastest local AI engine for Apple Silicon. Optimised for agentic use.

AdaExplore: Failure-Driven Adaptation and Diversity-Preserving Search for Efficient Kernel Generation

Self-Evolving LLM Memory Extraction Across Heterogeneous Tasks