@leloykun: I lost track of time again >.< I'm really sorry if you DMed me lately. I promise to go over my DMs! --- This sprint, I …
Summary
The author developed a Lean4-to-TileLang tensor program superoptimizer that automatically generates optimized accelerator kernels and derives hyperparameter scaling laws, achieving a 1.8x speedup on A100 GPUs.
Similar Articles
@leloykun: [WIP] Blog post on Lean4-to-TileLang Tensor Program Superoptimizer here:
A technical blog post introduces a Lean4-to-TileLang tensor program superoptimizer that automatically generates optimized GPU/TPU kernels and hyperparameter scaling laws, demonstrating performance gains over torch.compile.
AccelOpt: A Self-Improving LLM Agentic System for AI Accelerator Kernel Optimization
AccelOpt is a self-improving LLM agentic system that autonomously optimizes AI accelerator kernels through iterative generation and optimization memory, achieving 49-61% peak throughput improvements on AWS Trainium while being 26x cheaper than Claude Sonnet 4.
@leopardracer: https://x.com/leopardracer/status/2055341758523883631
A user shares their experience setting up a dual-GPU local AI lab with RTX 4080 Super and 5060 Ti, running Qwen 3.6 models via llama.cpp and llama-swap to reduce API costs and enable unrestricted experimentation.
@pupposandro: 2.5x faster than llama.cpp on Strix Halo. We just shipped DFlash + PFlash for the AMD Ryzen AI MAX+ 395 iGPU (gfx1151, …
A new toolset (DFlash + PFlash) achieves 2.5x faster inference than llama.cpp on AMD Ryzen AI MAX+ 395 iGPU, demonstrating significant speedups for Qwen3.6-27B with 128 GiB unified memory.
A hackable compiler to generate efficient fused GPU kernels for AI models [P]
The author presents a custom, hackable ML compiler written in Python that lowers LLMs to optimized CUDA kernels through a multi-stage IR pipeline, achieving performance competitive with or superior to PyTorch on specific operations. The article details the compiler's optimization passes, lowering rules, and CLI usage for generating efficient fused GPU kernels.