@no_stp_on_snek: My second and late Build Small submission. 10 days, 1 dev: a from-scratch Rust engine + custom GPU kernels vs vLLM on N…

X AI KOLs Following 06/15/26, 11:47 PM Tools

rust gpu-kernels inference-engine benchmark open-source build-small hackathon

Summary

A developer built a from-scratch Rust inference engine with custom GPU kernels that outperforms vLLM on Nemotron-30B decoding, achieving 75.7 vs 57 tok/s, submitted to the Build Small hackathon.

My second and late Build Small submission. 10 days, 1 dev: a from-scratch Rust engine + custom GPU kernels vs vLLM on NVIDIA's GB10, on NVIDIA's own Nemotron-30B. Decode beats vLLM at every depth (75.7 vs 57 tok/s). Prefill close but no cigar. Not bad for the timeline. Time for a computer break https://huggingface.co/spaces/build-small-hackathon/ffai-vs-vllm-gb10… @huggingface @Gradio @nvidia #BuildSmall

Original Article

View Cached Full Text

Cached at: 06/16/26, 01:09 AM

My second and late Build Small submission. 10 days, 1 dev: a from-scratch Rust engine + custom GPU kernels vs vLLM on NVIDIA’s GB10, on NVIDIA’s own Nemotron-30B. Decode beats vLLM at every depth (75.7 vs 57 tok/s). Prefill close but no cigar. Not bad for the timeline.

Time for a computer break

https://huggingface.co/spaces/build-small-hackathon/ffai-vs-vllm-gb10… @huggingface @Gradio @nvidia #BuildSmall

FFAI vs vLLM on GB10 - a Hugging Face Space by build-small-hackathon

Source: https://huggingface.co/spaces/build-small-hackathon/ffai-vs-vllm-gb10 Fetching metadata from the HF Docker repository...

Similar Articles

@no_stp_on_snek: https://x.com/no_stp_on_snek/status/2052833502475833384

X AI KOLs Following

An open-source stack using Qwen2.5-32B-Instruct with longctx and vllm-turboquant on a single AMD MI300X achieves competitive results (0.601-0.688) versus SubQ's closed model (0.659) on the MRCR v2 1M-context benchmark, demonstrating open-weights approaches are within striking distance.

@charles_irl: Somehow missed this one in the hustle and bustle. Very cool demo!

X AI KOLs Following

A developer built a 12M parameter LLM using a custom ML framework with a Rust backend and CUDA kernels, including Flash Attention and AdamW, and trained it from scratch.

@binsquares: omg, GPU acceleration on smolvm works way better than I thought. can run llama.cpp inside the smol machine with close t…

X AI KOLs Following

User @binsquares reports that GPU acceleration on smolvm achieves nearly 90% of host performance when running llama.cpp via the Vulkan backend.

I put together a Rust-native, CPU-only implementation of LFM2.5-8B-A1B

Reddit r/LocalLLaMA

The author released a pure Rust, CPU-only inference implementation of the LFM2.5-8B-A1B model (4-bit Q4KM quantization), achieving a decode speed of approximately 37 tokens/s and memory usage around 7GB. The goal is to make LLMs runnable on cheap VPS or older machines. The implementation is open source and published as a cargo crate.

Wrote a custom C++ engine for MiniCPM-V 4.6 on Orange Pi AIPro (Ascend 310B) to bypass framework overhead

Reddit r/LocalLLaMA

Developed a custom C++ inference engine for MiniCPM-V 4.6 on Orange Pi AIPro (Ascend 310B NPU), achieving 2x speedup over stock framework by writing optimized AscendC kernels for matmul and causal-conv1d, reaching 5.90 tokens/s.

FFAI vs vLLM on GB10 - a Hugging Face Space by build-small-hackathon

Similar Articles

@no_stp_on_snek: https://x.com/no_stp_on_snek/status/2052833502475833384

@charles_irl: Somehow missed this one in the hustle and bustle. Very cool demo!

@binsquares: omg, GPU acceleration on smolvm works way better than I thought. can run llama.cpp inside the smol machine with close t…

I put together a Rust-native, CPU-only implementation of LFM2.5-8B-A1B

Wrote a custom C++ engine for MiniCPM-V 4.6 on Orange Pi AIPro (Ascend 310B) to bypass framework overhead

Submit Feedback