@leloykun: I lost track of time again >.< I'm really sorry if you DMed me lately. I promise to go over my DMs! --- This sprint, I …

X AI KOLs Following 05/12/26, 05:47 AM Tools

tensor-optimization lean4 tilelang kernel-generation ai-infrastructure compiler

Summary

The author developed a Lean4-to-TileLang tensor program superoptimizer that automatically generates optimized accelerator kernels and derives hyperparameter scaling laws, achieving a 1.8x speedup on A100 GPUs.

I lost track of time again >.< I'm really sorry if you DMed me lately. I promise to go over my DMs! --- This sprint, I built a Lean4-to-TileLang Tensor Program Superoptimizer. With this, I now have a formal infrastructure where I (or my agents) can define neural network architectures in Lean4, and automatically get: 1. Optimized IO-aware accelerator kernels in TileLang. It can find FlashAttention2, FlashNorm, split-k matmul, and others automatically. I'm currently getting a ~1.8x geomean speedup on my benchmark set on A100s. 2. Optimizer choices and parametrization that enable hyperparameter transfer across width and depth (see my previous blog posts). 3. Hyperparameter scaling laws that tell us how to adjust hyperparameters as we scale batch size, training horizon, dataset size, and etc. (see quoted tweet). 4. Low-rank proxies for the optimizers to speed up hyperparameter tuning at small scales and have them transfer to the full-rank case (we have an upcoming paper on this, stay tuned!).

Original Article

Similar Articles

@leloykun: [WIP] Blog post on Lean4-to-TileLang Tensor Program Superoptimizer here:

X AI KOLs Following

A technical blog post introduces a Lean4-to-TileLang tensor program superoptimizer that automatically generates optimized GPU/TPU kernels and hyperparameter scaling laws, demonstrating performance gains over torch.compile.

AccelOpt: A Self-Improving LLM Agentic System for AI Accelerator Kernel Optimization

Hugging Face Daily Papers

AccelOpt is a self-improving LLM agentic system that autonomously optimizes AI accelerator kernels through iterative generation and optimization memory, achieving 49-61% peak throughput improvements on AWS Trainium while being 26x cheaper than Claude Sonnet 4.

@leopardracer: https://x.com/leopardracer/status/2055341758523883631

X AI KOLs Timeline

A user shares their experience setting up a dual-GPU local AI lab with RTX 4080 Super and 5060 Ti, running Qwen 3.6 models via llama.cpp and llama-swap to reduce API costs and enable unrestricted experimentation.

@pupposandro: 2.5x faster than llama.cpp on Strix Halo. We just shipped DFlash + PFlash for the AMD Ryzen AI MAX+ 395 iGPU (gfx1151, …

X AI KOLs Following

A new toolset (DFlash + PFlash) achieves 2.5x faster inference than llama.cpp on AMD Ryzen AI MAX+ 395 iGPU, demonstrating significant speedups for Qwen3.6-27B with 128 GiB unified memory.

A hackable compiler to generate efficient fused GPU kernels for AI models [P]

Reddit r/MachineLearning

The author presents a custom, hackable ML compiler written in Python that lowers LLMs to optimized CUDA kernels through a multi-stage IR pipeline, achieving performance competitive with or superior to PyTorch on specific operations. The article details the compiler's optimization passes, lowering rules, and CLI usage for generating efficient fused GPU kernels.