@no_stp_on_snek: My second and late Build Small submission. 10 days, 1 dev: a from-scratch Rust engine + custom GPU kernels vs vLLM on N…

X AI KOLs Following Tools

Summary

A developer built a from-scratch Rust inference engine with custom GPU kernels that outperforms vLLM on Nemotron-30B decoding, achieving 75.7 vs 57 tok/s, submitted to the Build Small hackathon.

My second and late Build Small submission. 10 days, 1 dev: a from-scratch Rust engine + custom GPU kernels vs vLLM on NVIDIA's GB10, on NVIDIA's own Nemotron-30B. Decode beats vLLM at every depth (75.7 vs 57 tok/s). Prefill close but no cigar. Not bad for the timeline. Time for a computer break https://huggingface.co/spaces/build-small-hackathon/ffai-vs-vllm-gb10… @huggingface @Gradio @nvidia #BuildSmall
Original Article
View Cached Full Text

Cached at: 06/16/26, 01:09 AM

My second and late Build Small submission. 10 days, 1 dev: a from-scratch Rust engine + custom GPU kernels vs vLLM on NVIDIA’s GB10, on NVIDIA’s own Nemotron-30B. Decode beats vLLM at every depth (75.7 vs 57 tok/s). Prefill close but no cigar. Not bad for the timeline.

Time for a computer break

https://huggingface.co/spaces/build-small-hackathon/ffai-vs-vllm-gb10… @huggingface @Gradio @nvidia #BuildSmall


FFAI vs vLLM on GB10 - a Hugging Face Space by build-small-hackathon

Source: https://huggingface.co/spaces/build-small-hackathon/ffai-vs-vllm-gb10 Fetching metadata from the HF Docker repository...

Similar Articles

@no_stp_on_snek: https://x.com/no_stp_on_snek/status/2052833502475833384

X AI KOLs Following

An open-source stack using Qwen2.5-32B-Instruct with longctx and vllm-turboquant on a single AMD MI300X achieves competitive results (0.601-0.688) versus SubQ's closed model (0.659) on the MRCR v2 1M-context benchmark, demonstrating open-weights approaches are within striking distance.

I put together a Rust-native, CPU-only implementation of LFM2.5-8B-A1B

Reddit r/LocalLLaMA

The author released a pure Rust, CPU-only inference implementation of the LFM2.5-8B-A1B model (4-bit Q4KM quantization), achieving a decode speed of approximately 37 tokens/s and memory usage around 7GB. The goal is to make LLMs runnable on cheap VPS or older machines. The implementation is open source and published as a cargo crate.