@QingQ77: Pure Rust LLM inference engine with custom CUDA kernels for each hardware × model × quantization combination, achieving higher inference speed than vLLM and TensorRT-LLM. https://github.com/Avarok-Cybersecurity/a…

X AI KOLs Timeline 05/08/26, 09:03 AM Tools

rust llm-inference open-source cuda-kernels performance inference-engine

Summary

Atlas is a pure Rust LLM inference engine that delivers faster inference than vLLM and TensorRT-LLM by customizing CUDA kernels for each hardware × model × quantization combination.

Pure Rust LLM inference engine, with custom CUDA kernels for each hardware × model × quantization combination, achieving higher inference speed than vLLM and TensorRT-LLM. https://github.com/Avarok-Cybersecurity/atlas… Atlas is a pure Rust LLM inference engine, with the entire architecture using traits to split model loading, layer computation, GPU backend, communication, and storage into pluggable modules. It currently targets NVIDIA GB10 first, with 12 hand-tuned kernel goals, running Qwen3.5/3.6/3-Next/3-VL, Gemma-4, Mistral-Small-4, MiniMax-M2, Nemotron-H models. Qwen3.5-35B-A3B using MTP speculative decoding achieves 131 tok/s, faster than NVIDIA's own vLLM on the same GB10. KV Cache has six quantization levels from BF16 to Turbo3 (3-bit Lloyd-Max). Turbo series uses Walsh-Hadamard rotation to reduce error at the same bit width.

Original Article Export to Word Export to PDF

View Cached Full Text

Cached at: 05/08/26, 05:37 PM

Atlas Inference Engine Pure Rust LLM Inference Universal Inference At Unimaginable Speeds

Similar Articles

Every AI researcher should grasp inference acceleration—CUDA Graph is the heart of vLLM's GPU efficiency

X AI KOLs Timeline

A tweet urging AI researchers to learn inference-acceleration basics and spotlighting CUDA Graph as the key to vLLM’s GPU utilization.

@LinQingV: When exploring LLM inference chip architectures previously, I reviewed the architectures of the four major AI inference ASIC companies: Groq, SambaNova, Tenstorrent, and Cerebras. While the first three have different emphases, their underlying logic falls within the same framework: large on-chip SRAM + dataflow architecture + deterministic scheduling...

X AI KOLs Timeline

The article analyzes the AI inference ASIC architectures of Groq, SambaNova, Tenstorrent, and Cerebras, highlighting Cerebras's unique wafer-scale engine design. It discusses the benefits of deterministic latency and high bandwidth for LLM inference, while noting challenges like yield, cost, and KV cache bottlenecks.

SwiftLM: Pure-Swift Apple Silicon LLM inference server—no Python, runs big models on low-RAM Macs

X AI KOLs Timeline

SwiftLM is a Swift-native LLM inference server for Apple Silicon that runs large models without Python, using SSD streaming to load MoE weights and enabling 122B models on 64 GB Macs.

@seclink: It seems Ollama has been thoroughly bested by vLLM. Given the rapid pace of large model development (with new models released almost weekly), using vLLM is often more practical and convenient than using tools like DeepSpeed or TensorRT.

X AI KOLs Following

The article argues that vLLM has overtaken Ollama in usability due to the rapid pace of new model releases, finding it more practical than alternatives like DeepSpeed or TensorRT.

@linexjlin: K2.6 built a Zig LLM inference engine from scratch on Mac in 12h, pushing Qwen 3.5 0.8B from 15 tok/s to 193.1 tok/s

X AI KOLs Timeline

Developer wrote a Zig-based LLM inference engine from zero on macOS in 12 hours, boosting Qwen 3.5 0.8B throughput from 15 to 193 tokens per second.

Similar Articles

Every AI researcher should grasp inference acceleration—CUDA Graph is the heart of vLLM's GPU efficiency

SwiftLM: Pure-Swift Apple Silicon LLM inference server—no Python, runs big models on low-RAM Macs

@seclink: It seems Ollama has been thoroughly bested by vLLM. Given the rapid pace of large model development (with new models released almost weekly), using vLLM is often more practical and convenient than using tools like DeepSpeed or TensorRT.

@linexjlin: K2.6 built a Zig LLM inference engine from scratch on Mac in 12h, pushing Qwen 3.5 0.8B from 15 tok/s to 193.1 tok/s

Submit Feedback