@Akashi203: i open-sourced automegakernel -- compiles any huggingface model into a single persistent megakernel batch-1 decode is b…

X AI KOLs Timeline Tools

Summary

AutoMegaKernel is an open-source agent harness that compiles any HuggingFace model into a single persistent megakernel, fusing the entire forward pass into one GPU launch to reduce overhead. It achieves up to 1.33x speedup over CUDA-graphed cuBLAS on inference-class GPUs like L4 and L40S, while proving schedules deadlock- and race-free.

i open-sourced automegakernel -- compiles any huggingface model into a single persistent megakernel batch-1 decode is bandwidth-bound. normal execution launches one kernel per op and round-trips activations through HBM dozens of times a layer. that overhead is the whole problem automegakernel fuses the entire forward pass into one launch. one launch = one forward = one token the hard part is a single kernel across every SM synced only by counters is a deadlock/race minefield. so the core piece is a static validator that proves any schedule deadlock-free + race-free before launch. an agent can edit the schedule freely and can't ship a hanging kernel. 7160 adversarial schedules, 6091 unsafe, zero false accepts one source retargets sm_80 / sm_90 / sm_120. reproduces huggingface greedy decode token-for-token on real smollm2-135m search-found int8 megakernel beats cuda-graphed cuBLAS bf16 at batch-1: L4 up to 1.33x L40S 1.25-1.27x. it loses on A100/H100 and we say so llama-family only for now:p sc: http://github.com/RightNow-AI/AutoMegaKernel… paper: https://arxiv.org/abs/2606.09682
Original Article
View Cached Full Text

Cached at: 06/18/26, 10:13 AM

i open-sourced automegakernel – compiles any huggingface model into a single persistent megakernel

batch-1 decode is bandwidth-bound. normal execution launches one kernel per op and round-trips activations through HBM dozens of times a layer. that overhead is the whole problem

automegakernel fuses the entire forward pass into one launch. one launch = one forward = one token

the hard part is a single kernel across every SM synced only by counters is a deadlock/race minefield. so the core piece is a static validator that proves any schedule deadlock-free + race-free before launch. an agent can edit the schedule freely and can’t ship a hanging kernel. 7160 adversarial schedules, 6091 unsafe, zero false accepts

one source retargets sm_80 / sm_90 / sm_120. reproduces huggingface greedy decode token-for-token on real smollm2-135m

search-found int8 megakernel beats cuda-graphed cuBLAS bf16 at batch-1: L4 up to 1.33x L40S 1.25-1.27x. it loses on A100/H100 and we say so

llama-family only for now:p

sc: http://github.com/RightNow-AI/AutoMegaKernel… paper: https://arxiv.org/abs/2606.09682


RightNow-AI/AutoMegaKernel

Source: https://github.com/RightNow-AI/AutoMegaKernel

AutoMegaKernel: a statically-checked agent harness for self-retargeting megakernel synthesis

AutoMegaKernel (AMK)

A general agent harness for GPU megakernel synthesis. A coding agent (Claude Code / Codex) drives AMK to compile a model into one provably-correct, self-retargeting megakernel, the whole forward pass fused into a single persistent launch, then self-tunes it, and it gets better every time it runs.

Repo: https://github.com/RightNow-AI/AutoMegaKernel  ·  License: MIT  ·  the open research harness (Enterprise / Forge)

amk compile <model> --gpu <arch> --regime single-stream
# imports -> lowers -> validates (deadlock + race free) -> verifies vs eager -> builds the GPU
# megakernel -> measures it against the HBM roofline -> emits a correct megakernel + report

AMK is the sibling of AutoKernel: AutoKernel auto-generates the single best kernel; AMK auto-generates the single best whole-model megakernel. It inherits AutoKernel’s autoresearch loop (propose → fixed eval → keep/revert, for hours, unattended) and adds a new search axis: the schedule.

Coverage today: the HuggingFace Llama family on CUDA (sm_75 to sm_120). Roadmap (future work): generalizing the importer and backends to more architectures, languages, and targets. The harness (validator, oracle, search loop) is not model-specific; the agent is what supplies that generality, and broadening it is the central direction of the work.


Results: int8 beats cuBLAS across the inference fleet

AMK’s auto-tuned int8 (W8A16, near-lossless) megakernel beats CUDA-graphed cuBLAS bf16 at batch-1 decode across NVIDIA’s datacenter inference-class GPUs, found autonomously by AMK’s own search and correctness-gated (argmax-exact). Ratio = cuBLAS / AMK, so > 1 means AMK is faster.

GPUClassint8 vs cuBLAS bf16 (best)Verdict
L4 (sm_89, 300 GB/s)inference1.18× → 1.33× (1.3B→4B)✅ wins, grows with size
L40S (sm_89, 864 GB/s)inference flagship1.25× → 1.27× (4B→6.7B)✅ wins
A10G (sm_86, 600 GB/s)inference1.04× → 1.08× (≥3.5B)✅ wins at scale
RTX 5090 (sm_120)consumer (local)1.19× → 1.23×✅ wins
A100 (sm_80, 1.4 TB/s)training0.79× → 0.55× (1.3B→13B)✗ trails (declines w/ size)
H100 (sm_90, 3.1 TB/s)training0.72× → 0.60×✗ trails

Inference / datacenter GPUs measured on Modal (reproducible; file-backed in paper/results/int8_scale_*.json). RTX 5090 measured locally (int8_search_multisize.json; Modal has no RTX 5090 silicon, so it is not in the Modal sweep).

What governs the win is the inference-class vs. training-class regime, not bandwidth ordering: the 864 GB/s L40S wins by more than the 600 GB/s A10G. AMK reads ~half the weight bytes (int8), and that advantage has to overcome a fixed per-tile cross-SM sync; larger, GEMV-dominated models amortize that fixed cost. The training-class A100/H100 never cross (the ratio declines with size, the fingerprint of the sync deficit; a cp.async int8-GEMV probe even regressed to 0.82× on A100, confirming cross-SM sync, not load latency, is the binder). The win is batch-1, pos-0 / low-context. Full data and analysis: DATACENTER_RESULTS.md.

The equal-precision gap (stated plainly)

On the like-for-like bf16 path AMK trails cuBLAS. On a 622.9 MB model the bf16 kernel runs at ~1.38 ms/token, ~1.24× slower than CUDA-graphed cuBLAS; the optimized bf16 GEMV sustains ~451 GB/s = ~51% of spec / ~63% of measured HBM peak (cuBLAS ceiling ~90%). The int8 win therefore comes from streaming fewer bytes, not from a faster kernel; the int8 kernel on the same model is ~1.22 ms/token (~1.09× slower than cuBLAS bf16). The remaining lever is a bandwidth-saturating cp.async GEMV with coarser cross-SM sync, a kernel-quality push rather than a redesign, and the correctness-bearing architecture that makes that safe and automatic is already in place. We do not beat cuBLAS/vLLM at bf16 batch-1, and we say so. Reproduce (GPU): uv run pytest tests/test_cuda_perf.py; the 10-minute self-improving run: uv run python amk_cli.py autoresearch small --gpu rtx5090 --minutes 10.


What’s built and measured

  • The whole forward pass runs as one cooperative kernel launch. The persistent megakernel (one threadblock per SM, counter-synchronized) executes a full Llama-style decode and matches eager PyTorch and the CPU reference VM to ~1e-7 (fp32) / bf16 tolerance.
  • Self-retargeting, measured on three architectures. The same source built and ran a correct megakernel on sm_120 (RTX 5090), sm_80 (A100), and sm_90 (H100), with the nvcc gencode derived from the live device (the H100 ran a 3 GB / 3202-task Llama-1B-shaped decode correctly). See DATACENTER_RESULTS.md.
  • Multi-token generation that matches eager. AMK greedily decodes a sequence, threading a persistent KV cache across steps; the generated token ids are identical to eager greedy decode. Reproduce: uv run amk generate toy --gpu rtx5090 --prompt-ids "1,2,3" --max-tokens 32 --verify.
  • A real trained checkpoint, end-to-end. AMK imports HuggingFaceTB/SmolLM2-135M (real weights + tokenizer) and reproduces HuggingFace’s own greedy generate token-for-token. Reproduce: uv run python examples/run_hf_model.py (also tests/test_hf_checkpoint.py).
  • A statically-checked schedule validator. Across 7,160 adversarial schedules the validator had zero false-accepts; an unsafe agent-proposed schedule is REJECTED at validation time instead of hanging the GPU.
  • A native coding-agent harness. Drivable by any coding agent (Claude Code / Codex) through one structured edit surface: MCP server + Claude Code skill/commands/subagent/workflow + Codex AGENTS.md. See docs/AGENT_HARNESS.md and HARNESS.md.
  • A 10-minute unattended autoresearch run self-improves the megakernel 1.47× over its own starting schedule.
  • 98 tests green (uv run pytest): 78 pass on CPU; 20 CUDA tests auto-skip without a GPU.

Why a megakernel

Single-stream decode is bandwidth-bound: each token must stream the whole weight set through the SMs once. The theoretical floor is weights_bytes / HBM_bandwidth. A normal PyTorch / cuBLAS execution launches one kernel per op and round-trips activations through HBM between every op, paying launch latency and a memory bubble dozens of times per layer.

A megakernel launches once, keeps the persistent threadblocks resident on every SM, and walks the model’s dependency graph in-place: activations live in on-chip pages, the next layer’s weights are prefetched while the current layer computes, and there is no kernel-launch bubble between ops. The win regime is single-stream / low-batch decode latency: voice, realtime, agentic loops. We do not claim to beat throughput-optimized serving at high batch; that is compute-bound and not this fight.

How it works

Generation is confined inside a verified structure, so correctness is a property of the architecture, not of the model output.

Four layers

LayerRoleTrust model
0: VM (vm/)Persistent kernel: per-SM scheduler loop, page-based scratchpad, counter-based sync. Launched once, runs the whole forward pass.Trusted base. Hand-written, exhaustively verified, frozen per arch.
1: Instructions (instructions/)ABI-conformant micro-kernels (gemv/gemm tile, attention tile, RMSNorm, RoPE, SwiGLU, dequant…). Triton for iteration, CUDA for max perf.Each is correctness-checked vs its reference op in isolation before it can enter a megakernel.
2: Scheduler (schedule/)HF model → graph IR → tiled task-DAG → instruction stream + page allocation. Cost-model explore + on-hardware exploit.The research core. Proposes points in a search space the VM realizes safely.
3: Dynamism (dynamism/)Shape-parametric tiles + in-kernel dispatch: continuous batching, dynamic shapes, MoE routing.The relevance gate for real serving. (Roadmap: placeholder package.)

Deadlock-freedom by construction

A forward pass is a DAG. Producers only increment counters; consumers only wait on statically-known thresholds; execution is a topological walk with monotonic counters: no locks, no arbitrary signalling. The VM refuses to load any schedule that is not a valid DAG: an invalid schedule becomes a clean REJECTED at validation time instead of a hung GPU. This is what makes auto-generated schedules safe to run unattended.

Two autoresearch loops

  • Loop 1, Instruction optimization (this is AutoKernel): edit one ABI-conformant micro-kernel, isolated correctness-then-latency eval (~seconds), keep/revert. A wrong instruction fails its own unit test; no persistent kernel, no hang.
  • Loop 2, Schedule optimization (the new loop): the agent’s edit surface is the schedule IR, a structured object {tiling, fusion_grouping, sm_assignment, pipelining_depth, page_allocation} plus kernel knobs, not megakernel code. The frozen VM deterministically lowers it; every proposal is statically DAG-validated before launch. The full contract is in HARNESS.md.

The four properties (the product spec)

  1. Generality: one command compiles a model into a verified megakernel with zero per-model hand-written CUDA (today: the HF Llama family; broadening coverage is the roadmap).
  2. Self-retargeting: when new silicon ships, AMK retargets in days via search + on-hardware verification. Already measured across sm_120 / sm_80 / sm_90. This is the moat.
  3. A standard IR: AMK owns the canonical megakernel IR: the SM-level task-DAG, the instruction ABI, the schedule format. See docs/IR_SPEC.md, schedule/ir.py, vm/abi.h.
  4. A data flywheel: every run logs (model, gpu, schedule, instruction, measured result); that corpus trains a learned prior so every future run starts smarter.

Native coding-agent integration

AMK exposes the verified loop substrate natively to coding agents, with the same behavior and the same honesty rules. The single guide is docs/AGENT_HARNESS.md.

  • MCP server (amk_mcp.py): amk_doctor / amk_propose / amk_eval / amk_loop / amk_autoresearch / amk_orchestrate_*. Enable with uv sync --extra agent; register via .mcp.json (Claude Code) or ~/.codex/config.toml (Codex).
  • Claude Code: the megakernel-optimization skill; the /amk-optimize, /amk-autoresearch, /amk-compile slash commands; the amk-megakernel-optimizer subagent; a workflow and a goal under .claude/.
  • Codex: AGENTS.md + the same MCP server.

Quickstart

AMK installs as a real package (hatchling) and exposes a real amk console command. uv sync provisions the full environment (torch cu128, numpy, transformers for the HF importer, ninja for the CUDA JIT build, pytest); uv is the recommended path. With pip: pip install "automegakernel[models,cuda]", note the cu128 torch pin only applies under uv, so pip users on Blackwell/sm_120 (or any specific CUDA build) should install the matching torch wheel first, e.g. pip install torch --index-url https://download.pytorch.org/whl/cu128.

uv sync                              # provision the env + install the `amk` command (editable)

# --- No GPU required (works on a fresh CPU-only machine) ---
uv run pytest                        # full suite (98 tests; 78 on CPU, 20 CUDA auto-skip without a GPU)
amk doctor                           # environment + GPU + nvcc + registered targets
amk eval toy --device cpu            # one structured correctness verdict on the CPU reference VM

# --- Requires a CUDA GPU + nvcc ---
amk compile toy --gpu rtx5090 --regime single-stream                            # model -> verified megakernel + report
amk generate toy --gpu rtx5090 --prompt-ids "1,2,3" --max-tokens 32 --verify    # multi-token decode == eager
amk eval toy --gpu rtx5090                                                       # correctness + measured-GPU latency

Every subcommand is also runnable as uv run python amk_cli.py <cmd> ... (identical behavior). A real trained checkpoint runs end-to-end via uv run python examples/run_hf_model.py.

Honesty rules (enforced by the harness, not just stated)

  • Never a latency number without its paired correctness result. eval/bench.py refuses to emit a latency without a verdict from eval/oracle.py.
  • Correctness = full-model logit equivalence within tolerance plus generated-token agreement vs eager PyTorch over a sequence.
  • Always report distance to the weights / HBM_bandwidth roofline (eval/roofline.py).
  • Measured numbers are produced on the hardware named in the flywheel corpus / results.tsv; the datacenter numbers in DATACENTER_RESULTS.md are measured. We do not transcribe numbers we did not measure.
  • We are near-bandwidth-bound nowhere yet on the bf16 path, and we say so. The honest current claims are: the int8 inference-fleet cuBLAS win, generality, self-retargeting, trust (deadlock + race free by construction), and honest distance-to-roofline.

Repo layout

vm/            Layer 0, trusted megakernel VM (CUDA) + CPU reference simulator + verify
instructions/  Layer 1, ABI-conformant micro-kernels (Triton + CUDA) + generator + verify
schedule/      Layer 2, graph import, lowering, the STANDARD IR, cost model, search
dynamism/      Layer 3, continuous batching, dynamic shapes, MoE (roadmap placeholder)
eval/          oracle (logit equivalence) · bench (latency) · baselines · roofline
amk_cli.py     the `amk` console command (doctor/compile/generate/eval/propose/loop/...)
compile.py     THE PRODUCT: amk compile <hf-model> --gpu <arch>
generate.py    autoregressive multi-token decode (KV cache threaded across steps)
harness.py     the coding-agent integration surface (Loop 2, schedule search)
examples/      run_hf_model.py, a real HF checkpoint end-to-end (SmolLM2-135M)
docs/          IR_SPEC.md (the standard IR) · AGENT_HARNESS.md (agent integration)
HARNESS.md     the coding-agent harness contract
DATACENTER_RESULTS.md  measured inference-fleet + sm_80/sm_90 self-retargeting results
program.md     the autonomous-operation brain (run AMK unattended)
models/        self-contained test models (small dense -> MoE)

Status

The correctness-bearing core, the GPU megakernel, self-retargeting, multi-token generation, the real-checkpoint path, and the agent harness are all built and measured. The active push is kernel-quality perf toward the roofline (see the gap). See program.md for the roadmap and the autonomous-loop discipline; the IR (schedule/ir.py, docs/IR_SPEC.md) and the instruction ABI (vm/abi.h) are the two stable contracts everything is built against.

License

MIT © 2026 RightNow AI

Enterprise

AutoMegaKernel is our open research harness. For production and enterprise needs we are building Forge, an internal advanced kernel generator for enterprises. For enterprise requests, contact [email protected].

Similar Articles

Ada-MK: Adaptive MegaKernel Optimization via Automated DAG-based Search for LLM Inference

arXiv cs.CL

This paper introduces Ada-MK, an adaptive MegaKernel optimization method that uses automated DAG-based search to eliminate runtime branching and reduce shared memory usage for LLM inference. It demonstrates significant throughput improvements on NVIDIA Ada GPUs by integrating with TensorRT-LLM, achieving up to 23.6% faster performance than vanilla TensorRT-LLM in commercial advertising systems.