@no_stp_on_snek: My second and late Build Small submission. 10 days, 1 dev: a from-scratch Rust engine + custom GPU kernels vs vLLM on N…
Summary
A developer built a from-scratch Rust inference engine with custom GPU kernels that outperforms vLLM on Nemotron-30B decoding, achieving 75.7 vs 57 tok/s, submitted to the Build Small hackathon.
View Cached Full Text
Cached at: 06/16/26, 01:09 AM
My second and late Build Small submission. 10 days, 1 dev: a from-scratch Rust engine + custom GPU kernels vs vLLM on NVIDIA’s GB10, on NVIDIA’s own Nemotron-30B. Decode beats vLLM at every depth (75.7 vs 57 tok/s). Prefill close but no cigar. Not bad for the timeline.
Time for a computer break
https://huggingface.co/spaces/build-small-hackathon/ffai-vs-vllm-gb10… @huggingface @Gradio @nvidia #BuildSmall
FFAI vs vLLM on GB10 - a Hugging Face Space by build-small-hackathon
Source: https://huggingface.co/spaces/build-small-hackathon/ffai-vs-vllm-gb10 Fetching metadata from the HF Docker repository...
Similar Articles
@no_stp_on_snek: https://x.com/no_stp_on_snek/status/2052833502475833384
An open-source stack using Qwen2.5-32B-Instruct with longctx and vllm-turboquant on a single AMD MI300X achieves competitive results (0.601-0.688) versus SubQ's closed model (0.659) on the MRCR v2 1M-context benchmark, demonstrating open-weights approaches are within striking distance.
@charles_irl: Somehow missed this one in the hustle and bustle. Very cool demo!
A developer built a 12M parameter LLM using a custom ML framework with a Rust backend and CUDA kernels, including Flash Attention and AdamW, and trained it from scratch.
@binsquares: omg, GPU acceleration on smolvm works way better than I thought. can run llama.cpp inside the smol machine with close t…
User @binsquares reports that GPU acceleration on smolvm achieves nearly 90% of host performance when running llama.cpp via the Vulkan backend.
I put together a Rust-native, CPU-only implementation of LFM2.5-8B-A1B
The author released a pure Rust, CPU-only inference implementation of the LFM2.5-8B-A1B model (4-bit Q4KM quantization), achieving a decode speed of approximately 37 tokens/s and memory usage around 7GB. The goal is to make LLMs runnable on cheap VPS or older machines. The implementation is open source and published as a cargo crate.
Wrote a custom C++ engine for MiniCPM-V 4.6 on Orange Pi AIPro (Ascend 310B) to bypass framework overhead
Developed a custom C++ inference engine for MiniCPM-V 4.6 on Orange Pi AIPro (Ascend 310B NPU), achieving 2x speedup over stock framework by writing optimized AscendC kernels for matmul and causal-conv1d, reaching 5.90 tokens/s.