@PyTorch: One runtime, multiple GPU architectures, and zero vendor-specific model code. In this blog post, the TokenSpeed team @l…

X AI KOLs Following Tools

Summary

TokenSpeed-Kernel is a portable, high-performance kernel system for LLM inference that enables zero vendor-specific model code and supports multiple GPU architectures, achieving up to 3.6x higher throughput on AMD MI355X.

One runtime, multiple GPU architectures, and zero vendor-specific model code. In this blog post, the TokenSpeed team @lightseekorg introduces TokenSpeed-Kernel, a portable, high-performance kernel system built for modern LLM inference. Using GPT-OSS 120B as a case study, they show how specialized kernels for @AIatAMD and @NVIDIAAI GPUs can seamlessly coexist behind a common API. This unified approach delivers up to 3.6x higher throughput on the AMD MI355X, all without requiring any changes to the underlying model logic. Link to blog in comments section
Original Article
View Cached Full Text

Cached at: 06/25/26, 07:25 PM

One runtime, multiple GPU architectures, and zero vendor-specific model code.

In this blog post, the TokenSpeed team @lightseekorg introduces TokenSpeed-Kernel, a portable, high-performance kernel system built for modern LLM inference. Using GPT-OSS 120B as a case study, they show how specialized kernels for @AIatAMD and @NVIDIAAI GPUs can seamlessly coexist behind a common API. This unified approach delivers up to 3.6x higher throughput on the AMD MI355X, all without requiring any changes to the underlying model logic.

Link to blog in comments section

Similar Articles

TorchKM: A GPU-Oriented Library for Kernel Learning and Model Selection

arXiv cs.LG

TorchKM is an open-source GPU-accelerated library for kernel machines (SVMs, kernel logistic regression, etc.) with a scikit-learn-style API. It accelerates training and model selection by reusing matrix operations, offering substantial speedups over standard baselines.

Rewriting model inference with CUDA kernels: the bottleneck was not just GEMM [P]

Reddit r/MachineLearning

Author describes building FlashRT, a CUDA-first inference runtime that rewrites model inference paths with C++/CUDA kernels to address bottlenecks beyond GEMM for small-batch/realtime workloads, achieving significant latency improvements on Jetson Thor and RTX 5090. The article discusses lessons on precision (FP8 helpful, FP4 mixed) and the need to bypass generic runtimes for realtime inference.