@LigengZhu: Excited to share the KDA: Kernel Design Agents that powers HAN Lab Kernel Mafia top ranking #1~3 kernels at Kernel Cont…
Summary
KDA is an agent-driven kernel design framework that helped HAN Lab achieve top rankings in the MLSys FlashInfer Kernel Contest by minimizing human involvement. The agent leverages Humanize, KernelWiki, and profiler skills to produce state-of-the-art kernels.
View Cached Full Text
Cached at: 05/23/26, 08:16 PM
Excited to share the KDA: Kernel Design Agents that powers HAN Lab Kernel Mafia top ranking #1~3 kernels at Kernel Contest
Thanks to agents, everyone can be a “kernel bro” in 2026: By adapting the KDA, the team ranked #1 in MoE, #2 in DSA, and #3 in GDN in the Pure Agent track at MLSys FlashInfer Kernel Contest – especially given the fact that the main participant (dongyun zou) has only written ~400 LoC triton and 0 lines of CUDA in 2026.
The core philosophy here is to leverage Humanize (the best harness framework) to let the agent run autonomously for as long as possible. By minimizing human involvement and input, and placing full trust in the agent, we can achieve kernel performance that nears SOTA levels.
HAN Lab Mafia Solution to MLSys’26 Kernel Contest: https://github.com/mit-han-lab/mlsys2026-flashinfer-contest… KDA Github: https://github.com/mit-han-lab/kernel-design-agents…
mit-han-lab/mlsys2026-flashinfer-contest
Source: https://github.com/mit-han-lab/mlsys2026-flashinfer-contest
HAN Lab Kernel Mafia MLSys2026 Flashinfer Constest Release
This repository releases the prompts, workflow documentation, and a minimal verification example for our MLSys 2026 FlashInfer Full-Agent track effort. The submitted kernels were produced by a fully agent-driven optimization workflow with KDA (Kernel Develop Agents). The core methods are Humanize (the best harness framework), Our Collected KernelWiki, and Nsight Compute Profile Skills.
- Team: HAN Lab Kernel Mafia
- Technical Report: docs/HAN_Lab_Kernel_Mafia_Technical_Report
- Generated Kernels / Final solution: mit-han-lab/mlsys2026-flashinfer-contest-solution

The repository is intentionally small. It contains documentation, agent prompts, and a lightweight flashinfer-bench verification example for an externally packed solution.json. Final kernel source snapshots and the submission verification harness live in the separate submissions repository linked above. The reusable skills remain in their own repositories and are linked below.
Competition Results
Our agent workflow shows impressive results in all three Full-Agent Approach tracks of the MLSys 2026 Competition NVIDIA Track:
| Track | Result |
|---|---|
| MoE Track | 1st place |
| DSA Track | 2nd place |
| GDN Track | 3rd place |
The released prompts follow a three-stage optimization workflow:

Each stage uses the Humanize planning and RLCR loop to turn a phase prompt into an executable optimization plan:

Skill Ablation
This ablation was run after the competition, separately from the official contest submissions, so its numbers are meant to explain skill contributions rather than exactly match the competition results above. The skill ablation highlights that Humanize is the dominant contributor: it gives the agent a much stronger plan-execute-verify structure, turning each optimization attempt into a more disciplined loop instead of a loose sequence of trials. KernelWiki broadens the kernel knowledge the agent can consult, and ncu-report-skill lets the agent read finer-grained profiler evidence instead of relying only on benchmark scores as a black box. Those two skills are useful, but the largest and most central gain comes from Humanize.

Contents
| Path | Purpose |
|---|---|
verify.py | Minimal example that evaluates one packed FlashInfer solution.json with flashinfer-bench. |
prompts/ | Prompt template and task-specific prompts used for the agent workflow. |
skills/ | Git submodule links to the required Claude skills. |
docs/HAN_Lab_Kernel_Mafia_Technical_Report.pdf | Technical report. |
docs/reproduction.md | Environment, dataset, and benchmark reproduction notes. |
Fresh Workflow Setup
Clone this repository, install the benchmark environment, download the FlashInfer contest workloads, and prepare the agent workflow dependencies:
git clone --recurse-submodules https://github.com/mit-han-lab/mlsys2026-flashinfer-contest.git
cd mlsys2026-flashinfer-contest
git clone https://github.com/flashinfer-ai/flashinfer-bench.git /tmp/flashinfer-bench-main
uv sync --python 3.12
# uv.lock pins the contest-tested stack:
# flashinfer-python==0.6.8.post1, torch==2.12.0+cu132, triton==3.6.0.
# Use Python 3.12 or 3.13; Python 3.14 is not supported by all CUDA wheels.
# Required by some baselines and generated solutions that use DeepGEMM/CUTLASS/CuTe headers.
git clone https://github.com/deepseek-ai/DeepGEMM.git /tmp/DeepGEMM
uv pip install -e /tmp/DeepGEMM --no-build-isolation
uv run ./scripts/download_data.sh
Confirm that the workload dataset is visible:
uv run python -c "from flashinfer_bench import TraceSet; ts = TraceSet.from_path('data/flashinfer-trace'); print(sorted(ts.definitions)); print(sum(len(v) for v in ts.workloads.values()), 'workloads')"
Create a separate task implementation workspace from the official FlashInfer starter kit, then start the agent from there. This repository is the prompt/workflow release; do not implement kernels directly in this repository.
mkdir -p workspaces
git clone https://github.com/flashinfer-ai/flashinfer-bench-starter-kit.git workspaces/<task-name>
cd workspaces/<task-name>
export FIB_DATASET_PATH="$OLDPWD/data/flashinfer-trace"
Then choose a task prompt under prompts/, start a fresh agent session in the task implementation workspace, and paste the selected phase prompt. The released final kernels are not part of this workflow and must not be used as implementation input.
See docs/reproduction.md for full environment notes and packed-solution verification commands.
By default, the dataset is stored under data/flashinfer-trace inside this repository. Override it with:
export FIB_DATASET_PATH=/path/to/flashinfer-trace
Agent Workflow Dependencies
The workflow depends on Claude Code and Codex. Install humanize as a Claude Code plugin, and install KernelWiki and ncu-report-skill as Claude skills under ~/.claude/skills/.
This repository links the two required skills as git submodules under skills/ so they are visible in the release tree. If you did not clone with --recurse-submodules, initialize them from the repository root:
git submodule update --init --recursive
# link skills
mkdir -p ~/.claude/skills
ln -sfn "$PWD/skills/KernelWiki" ~/.claude/skills/KernelWiki
ln -sfn "$PWD/skills/ncu-report-skill" ~/.claude/skills/ncu-report-skill
# or clone skills directly
# mkdir -p ~/.claude/skills && cd ~/.claude/skills
# git clone https://github.com/DongyunZou/ncu-report-skill.git
# git clone https://github.com/DongyunZou/KernelWiki.git
Install humanize separately from Claude Plugin Marketplace
# Add PolyArch marketplace
/plugin marketplace add PolyArch/humanize
# Then install humanize plugin
/plugin install humanize@PolyArch
Release Boundary
Final kernels are stored only in mit-han-lab/mlsys2026-flashinfer-contest-solution.git as result snapshots. This link is for release provenance and final-result verification; it is not an input to the prompt-driven agent workflow. Agents MUST NOT clone or inspect the release repository while solving the tasks. Intermediate candidates, benchmark histories, and search DAGs are not part of this release. The prompts in prompts/ are meant to be run from a separate task implementation workspace created from the official FlashInfer starter kit. We do not place final kernels inside an agent starting workspace.
Running the full Agents workflow is not bitwise deterministic: search order, profiling noise, GPU scheduling, and model behavior can change. The external submissions repository is the source of truth for the released final kernel snapshots.
Similar Articles
AgentKernelArena: Generalization-Aware Benchmarking of GPU Kernel Optimization Agents
AgentKernelArena is an open-source benchmark for evaluating AI coding agents on GPU kernel optimization, assessing full agent workflows and generalization to unseen configurations across 196 tasks.
@RisingSayak: The kernels project at Hugging Face has been growing! We want it to be the go-to place for kernel devs and kernel users…
Hugging Face's kernels project is expanding and seeking contributors for agentic kernel development to provide real optimization value to models.
AdaExplore: Failure-Driven Adaptation and Diversity-Preserving Search for Efficient Kernel Generation
Researchers from Carnegie Mellon, University of Washington, and Arm propose AdaExplore, an LLM agent framework for GPU kernel code generation that achieves 3.12× and 1.72× speedups on KernelBench Level-2 and Level-3 benchmarks through failure-driven adaptation and diversity-preserving search, without additional fine-tuning.
KForge: LLM-Driven Cross-Platform Kernel Generation for AI Accelerators
KForge is a cross-platform framework that uses two collaborating LLM-based agents to automatically generate and optimize high-performance compute kernels for diverse AI accelerators, achieving significant speedups on NVIDIA B200 and Intel Arc B580 hardware.
Automated Kernel Discovery Towards Understanding High-dimensional Bayesian Optimization
The paper introduces Kernel Discovery, an LLM-driven evolutionary framework for high-dimensional Bayesian optimization that searches a broader kernel space and achieves state-of-the-art results on benchmarks.