I put together a Rust-native, CPU-only implementation of LFM2.5-8B-A1B
Summary
The author released a pure Rust, CPU-only inference implementation of the LFM2.5-8B-A1B model (4-bit Q4KM quantization), achieving a decode speed of approximately 37 tokens/s and memory usage around 7GB. The goal is to make LLMs runnable on cheap VPS or older machines. The implementation is open source and published as a cargo crate.
View Cached Full Text
Cached at: 06/09/26, 02:46 PM
Similar Articles
@QingQ77: Pure Rust LLM inference engine with custom CUDA kernels for each hardware × model × quantization combination, achieving higher inference speed than vLLM and TensorRT-LLM. https://github.com/Avarok-Cybersecurity/a…
Atlas is a pure Rust LLM inference engine that delivers faster inference than vLLM and TensorRT-LLM by customizing CUDA kernels for each hardware × model × quantization combination.
@NFTCPS: 4GB VRAM running 70B large model? It actually works! AirLLM did a clever trick — layered inference, not loading the whole model into VRAM at once, but layer by layer, compute and discard, squeezing the giant into a small GPU. The best part: 100% open source, freebie warning https://github.com/0xSo…
AirLLM is a fully open-source tool that uses layered inference (loading and releasing VRAM layer by layer) to enable 70B large language models to run on GPUs with only 4GB VRAM, without quantization, distillation, or pruning. It already supports running Llama3.1 405B on 8GB VRAM.
New LFM2.5 8b A1b model!!
Introducing LFM2.5 8b A1b, a new AI model with performance on par with Nemotron 3 Nano but at higher speed. Support is being added to SmallCode for non-standard tool calls.
LiquidAI/LFM2.5-8B-A1B-GGUF
LiquidAI releases a GGUF quantized version of their LFM2.5-8B-A1B model, with instructions for use across multiple inference engines.
@NFTCPS: Attention to those running large models locally! Someone has transformed llama.cpp into a performance beast — BeeLlama.cpp. With the same VRAM, inference speed triples and context capacity expands 7.5x. This isn't a slide deck; it's real benchmark data. It stuffs three top-tier optimizations into one codebase: DFlash speculative decoding…
BeeLlama.cpp is a fork of llama.cpp that integrates DFlash speculative decoding, TurboQuant/TCQ KV-cache compression, and adaptive draft control, achieving up to 3x faster inference and 7.5x context expansion on the same hardware.