@_akhaliq: GPU Forecasters Language Models as Selective Surrogates for Kernel Runtime Optimization

X AI KOLs Following 06/02/26, 03:53 PM Papers

gpu language-models kernel-optimization runtime-optimization forecasting surrogates

Summary

This paper proposes using language models as selective surrogates to optimize GPU kernel runtime, demonstrating a novel approach to performance forecasting.

GPU Forecasters Language Models as Selective Surrogates for Kernel Runtime Optimization https://t.co/s2r0lFWz9r

Original Article

View Cached Full Text

Cached at: 06/02/26, 07:38 PM

GPU Forecasters

Language Models as Selective Surrogates for Kernel Runtime Optimization https://t.co/s2r0lFWz9r

Similar Articles

A hackable compiler to generate efficient fused GPU kernels for AI models [P]

Reddit r/MachineLearning

The author presents a custom, hackable ML compiler written in Python that lowers LLMs to optimized CUDA kernels through a multi-stage IR pipeline, achieving performance competitive with or superior to PyTorch on specific operations. The article details the compiler's optimization passes, lowering rules, and CLI usage for generating efficient fused GPU kernels.

@vivekgalatage: Best structured reference I've found for GPU optimization - 450 papers, 14 years of research. Some techniques will have…

X AI KOLs Timeline

A tweet shares a structured reference of 450 papers on GPU optimization spanning 14 years, noting that while some techniques evolve, the mental models remain useful. It also references a lecture on GPU architectures by Onur Mutlu.

AdaExplore: Failure-Driven Adaptation and Diversity-Preserving Search for Efficient Kernel Generation

arXiv cs.CL

Researchers from Carnegie Mellon, University of Washington, and Arm propose AdaExplore, an LLM agent framework for GPU kernel code generation that achieves 3.12× and 1.72× speedups on KernelBench Level-2 and Level-3 benchmarks through failure-driven adaptation and diversity-preserving search, without additional fine-tuning.

Towards Multi-Model LLM Schedulers: Empirical Insights into Offloading and Preemption

arXiv cs.AI

This paper presents an empirical study on scheduling multiple LLMs on shared heterogeneous hardware, focusing on performance implications of CPU-GPU offloading and preemption. It finds that offloading causes non-linear decode degradation, especially for smaller models, and preemption overhead is dominated by model state reload, providing design guidance for future multi-model schedulers.

Extensions and limitations of the neural GPU

OpenAI Blog

This paper explores extensions and limitations of the Neural GPU model, demonstrating improvements through curriculum design and scaling, enabling it to learn arithmetic operations on decimal numbers and long expressions while identifying failure modes on symmetric inputs analogous to adversarial examples.