@PyTorch: One runtime, multiple GPU architectures, and zero vendor-specific model code. In this blog post, the TokenSpeed team @l…
Summary
TokenSpeed-Kernel is a portable, high-performance kernel system for LLM inference that enables zero vendor-specific model code and supports multiple GPU architectures, achieving up to 3.6x higher throughput on AMD MI355X.
View Cached Full Text
Cached at: 06/25/26, 07:25 PM
One runtime, multiple GPU architectures, and zero vendor-specific model code.
In this blog post, the TokenSpeed team @lightseekorg introduces TokenSpeed-Kernel, a portable, high-performance kernel system built for modern LLM inference. Using GPT-OSS 120B as a case study, they show how specialized kernels for @AIatAMD and @NVIDIAAI GPUs can seamlessly coexist behind a common API. This unified approach delivers up to 3.6x higher throughput on the AMD MI355X, all without requiring any changes to the underlying model logic.
Link to blog in comments section
Similar Articles
TokenSpeed: A Speed-of-Light LLM Inference Engine for Agentic Workloads (5 minute read)
Lightseek releases TokenSpeed, a high-performance LLM inference engine optimized for agentic workloads, featuring compiler-backed parallelism and advanced kernel optimizations that have been adopted by vLLM.
@PyTorch: PyTorch member Meta just open-sourced a GPU kernel that makes attention 2.3x faster on NVIDIA Blackwell. TLX Block Atte…
Meta open-sources TLX Block Attention, a warp-specialized Triton kernel that achieves 2.3x speedup for block-diagonal self-attention on NVIDIA Blackwell GPUs, with up to 3.5x speedup when fused with rotary embeddings.
@PyTorch: ExecuTorch now has an MLX delegate that runs PyTorch models on Apple Silicon GPUs. It supports LLMs, speech-to-text, an…
ExecuTorch now has an MLX delegate that enables GPU-accelerated inference for PyTorch models on Apple Silicon Macs, supporting LLMs, speech-to-text, and MoE models with quantization via TorchAO.
TorchKM: A GPU-Oriented Library for Kernel Learning and Model Selection
TorchKM is an open-source GPU-accelerated library for kernel machines (SVMs, kernel logistic regression, etc.) with a scikit-learn-style API. It accelerates training and model selection by reusing matrix operations, offering substantial speedups over standard baselines.
Rewriting model inference with CUDA kernels: the bottleneck was not just GEMM [P]
Author describes building FlashRT, a CUDA-first inference runtime that rewrites model inference paths with C++/CUDA kernels to address bottlenecks beyond GEMM for small-batch/realtime workloads, achieving significant latency improvements on Jetson Thor and RTX 5090. The article discusses lessons on precision (FP8 helpful, FP4 mixed) and the need to bypass generic runtimes for realtime inference.