Author describes building FlashRT, a CUDA-first inference runtime that rewrites model inference paths with C++/CUDA kernels to address bottlenecks beyond GEMM for small-batch/realtime workloads, achieving significant latency improvements on Jetson Thor and RTX 5090. The article discusses lessons on precision (FP8 helpful, FP4 mixed) and the need to bypass generic runtimes for realtime inference.
I’ve been working on a CUDA-first inference runtime for small-batch / realtime ML workloads. The core idea is simple: instead of treating PyTorch / TensorRT / generic graph runtimes as the main execution path, I rewrite the model inference path directly with C++/CUDA kernels. This started from robotics / VLA workloads, but the problem is more general. In small-batch inference, the bottleneck is often not just a single slow GEMM. A lot of latency comes from the runtime glue around the math: * fragmented small kernels * norm / residual / activation boundaries * quantize / dequantize overhead * layout transitions * Python / runtime scheduling * graph compiler fusion failures * precision conversion around FP8 / FP4 regions For cloud LLM serving, batching can hide a lot of this. For robotics, VLA, world models, and other realtime workloads, batch size is usually 1. There is nowhere to hide. Every launch, sync, and format boundary shows up directly in latency. Some current results from my implementation: |Model / workload|Hardware|FlashRT latency| |:-|:-|:-| |Pi0.5|Jetson Thor|\~44 ms| |Pi0|Jetson Thor|\~46 ms| |GROOT N1.6|Jetson Thor|\~41–45 ms| |Pi0.5|RTX 5090|\~17.6 ms| |GROOT N1.6|RTX 5090|\~12.5–13.1 ms| |Pi0-FAST|RTX 5090|\~2.39 ms/token| |Qwen3.6 27B|RTX 5090|\~129 tok/s with NVFP4| |Motus / Wan-style world model|RTX 5090|\~1.3s baseline → targeting \~100ms E2E| The Motus / world-model case is especially interesting. The baseline path is around 1.3s end-to-end. The target is \~100ms E2E, but the hard part is not simply “use a faster GEMM”. The bottlenecks are VAE, joint attention, launch fragmentation, and a large amount of glue around the actual math. One lesson from this work: lower precision is not automatically a win. FP8 has been consistently useful. FP4 / NVFP4 is more mixed. It can help memory footprint and some large GEMM regions, but if the FP4 region is small, discontinuous, or surrounded by conversion / scaling overhead, the end-to-end speedup can be tiny. For example, in some VLA / world-model paths, FP4 over FP8 only gives a few percent latency improvement unless the region is large and deeply fused. This changed how I think about inference optimization. For large-batch cloud serving, generic runtimes and batching are often enough. For realtime small-batch inference, the runtime overhead becomes the workload. Curious if others have seen similar behavior with torch.compile, TensorRT, XLA, Triton, or custom CUDA kernels. At what point do you stop trying to make a generic compiler optimize the model, and just rewrite the inference path directly? Implementation: [https://github.com/LiangSu8899/FlashRT](https://github.com/LiangSu8899/FlashRT)
NVIDIA and Google collaborate to optimize Gemma 4 models for local deployment across RTX GPUs, DGX Spark, and Jetson devices, enabling efficient on-device agentic AI with support for reasoning, coding, multimodal capabilities, and 35+ languages.
A benchmark shows that using vLLM with DFlash speculative decoding boosts Gemma 4 26B inference to ~578 tokens per second on a single RTX 5090, achieving a 2.56x speedup over baseline.
The author presents a custom, hackable ML compiler written in Python that lowers LLMs to optimized CUDA kernels through a multi-stage IR pipeline, achieving performance competitive with or superior to PyTorch on specific operations. The article details the compiler's optimization passes, lowering rules, and CLI usage for generating efficient fused GPU kernels.
Lucebox Hub provides optimized CUDA kernels (Megakernel, DFlash, PFlash) for local LLM inference, achieving significant speedups (2-10x) over llama.cpp on various models and GPUs.
This paper presents a systematic optimization study of real-time diffusion model inference on the Apple M3 Ultra, achieving 22.7 FPS at 512x512 resolution using CoreML conversion and a distillation model, revealing that CUDA-optimized techniques do not directly transfer to Apple's unified memory architecture.