@QingQ77: Pure Rust LLM inference engine with custom CUDA kernels for each hardware × model × quantization combination, achieving higher inference speed than vLLM and TensorRT-LLM. https://github.com/Avarok-Cybersecurity/a…

X AI KOLs Timeline Tools

Summary

Atlas is a pure Rust LLM inference engine that delivers faster inference than vLLM and TensorRT-LLM by customizing CUDA kernels for each hardware × model × quantization combination.

Pure Rust LLM inference engine, with custom CUDA kernels for each hardware × model × quantization combination, achieving higher inference speed than vLLM and TensorRT-LLM. https://github.com/Avarok-Cybersecurity/atlas… Atlas is a pure Rust LLM inference engine, with the entire architecture using traits to split model loading, layer computation, GPU backend, communication, and storage into pluggable modules. It currently targets NVIDIA GB10 first, with 12 hand-tuned kernel goals, running Qwen3.5/3.6/3-Next/3-VL, Gemma-4, Mistral-Small-4, MiniMax-M2, Nemotron-H models. Qwen3.5-35B-A3B using MTP speculative decoding achieves 131 tok/s, faster than NVIDIA's own vLLM on the same GB10. KV Cache has six quantization levels from BF16 to Turbo3 (3-bit Lloyd-Max). Turbo series uses Walsh-Hadamard rotation to reduce error at the same bit width.
Original Article
View Cached Full Text

Cached at: 05/08/26, 05:37 PM

Atlas Inference Engine

Pure Rust LLM Inference Universal Inference At Unimaginable Speeds

Similar Articles

I put together a Rust-native, CPU-only implementation of LFM2.5-8B-A1B

Reddit r/LocalLLaMA

The author released a pure Rust, CPU-only inference implementation of the LFM2.5-8B-A1B model (4-bit Q4KM quantization), achieving a decode speed of approximately 37 tokens/s and memory usage around 7GB. The goal is to make LLMs runnable on cheap VPS or older machines. The implementation is open source and published as a cargo crate.

@LinQingV: When exploring LLM inference chip architectures previously, I reviewed the architectures of the four major AI inference ASIC companies: Groq, SambaNova, Tenstorrent, and Cerebras. While the first three have different emphases, their underlying logic falls within the same framework: large on-chip SRAM + dataflow architecture + deterministic scheduling...

X AI KOLs Timeline

The article analyzes the AI inference ASIC architectures of Groq, SambaNova, Tenstorrent, and Cerebras, highlighting Cerebras's unique wafer-scale engine design. It discusses the benefits of deterministic latency and high bandwidth for LLM inference, while noting challenges like yield, cost, and KV cache bottlenecks.