@leloykun: I lost track of time again >.< I'm really sorry if you DMed me lately. I promise to go over my DMs! --- This sprint, I …

X AI KOLs Following Tools

Summary

The author developed a Lean4-to-TileLang tensor program superoptimizer that automatically generates optimized accelerator kernels and derives hyperparameter scaling laws, achieving a 1.8x speedup on A100 GPUs.

I lost track of time again >.< I'm really sorry if you DMed me lately. I promise to go over my DMs! --- This sprint, I built a Lean4-to-TileLang Tensor Program Superoptimizer. With this, I now have a formal infrastructure where I (or my agents) can define neural network architectures in Lean4, and automatically get: 1. Optimized IO-aware accelerator kernels in TileLang. It can find FlashAttention2, FlashNorm, split-k matmul, and others automatically. I'm currently getting a ~1.8x geomean speedup on my benchmark set on A100s. 2. Optimizer choices and parametrization that enable hyperparameter transfer across width and depth (see my previous blog posts). 3. Hyperparameter scaling laws that tell us how to adjust hyperparameters as we scale batch size, training horizon, dataset size, and etc. (see quoted tweet). 4. Low-rank proxies for the optimizers to speed up hyperparameter tuning at small scales and have them transfer to the full-rank case (we have an upcoming paper on this, stay tuned!).
Original Article

Similar Articles

A hackable compiler to generate efficient fused GPU kernels for AI models [P]

Reddit r/MachineLearning

The author presents a custom, hackable ML compiler written in Python that lowers LLMs to optimized CUDA kernels through a multi-stage IR pipeline, achieving performance competitive with or superior to PyTorch on specific operations. The article details the compiler's optimization passes, lowering rules, and CLI usage for generating efficient fused GPU kernels.