@_akhaliq: GPU Forecasters Language Models as Selective Surrogates for Kernel Runtime Optimization
Summary
This paper proposes using language models as selective surrogates to optimize GPU kernel runtime, demonstrating a novel approach to performance forecasting.
View Cached Full Text
Cached at: 06/02/26, 07:38 PM
GPU Forecasters
Language Models as Selective Surrogates for Kernel Runtime Optimization https://t.co/s2r0lFWz9r
Similar Articles
A hackable compiler to generate efficient fused GPU kernels for AI models [P]
The author presents a custom, hackable ML compiler written in Python that lowers LLMs to optimized CUDA kernels through a multi-stage IR pipeline, achieving performance competitive with or superior to PyTorch on specific operations. The article details the compiler's optimization passes, lowering rules, and CLI usage for generating efficient fused GPU kernels.
@vivekgalatage: Best structured reference I've found for GPU optimization - 450 papers, 14 years of research. Some techniques will have…
A tweet shares a structured reference of 450 papers on GPU optimization spanning 14 years, noting that while some techniques evolve, the mental models remain useful. It also references a lecture on GPU architectures by Onur Mutlu.
AdaExplore: Failure-Driven Adaptation and Diversity-Preserving Search for Efficient Kernel Generation
Researchers from Carnegie Mellon, University of Washington, and Arm propose AdaExplore, an LLM agent framework for GPU kernel code generation that achieves 3.12× and 1.72× speedups on KernelBench Level-2 and Level-3 benchmarks through failure-driven adaptation and diversity-preserving search, without additional fine-tuning.
Towards Multi-Model LLM Schedulers: Empirical Insights into Offloading and Preemption
This paper presents an empirical study on scheduling multiple LLMs on shared heterogeneous hardware, focusing on performance implications of CPU-GPU offloading and preemption. It finds that offloading causes non-linear decode degradation, especially for smaller models, and preemption overhead is dominated by model state reload, providing design guidance for future multi-model schedulers.
Extensions and limitations of the neural GPU
This paper explores extensions and limitations of the Neural GPU model, demonstrating improvements through curriculum design and scaling, enabling it to learn arithmetic operations on decimal numbers and long expressions while identifying failure modes on symmetric inputs analogous to adversarial examples.