I put together a Rust-native, CPU-only implementation of LFM2.5-8B-A1B

Reddit r/LocalLLaMA Tools

Summary

The author released a pure Rust, CPU-only inference implementation of the LFM2.5-8B-A1B model (4-bit Q4KM quantization), achieving a decode speed of approximately 37 tokens/s and memory usage around 7GB. The goal is to make LLMs runnable on cheap VPS or older machines. The implementation is open source and published as a cargo crate.

This is still a work in progress, but since recording the video, I added callbacks for tool use, more tests, and published it as a cargo crate. Currently working on speeding up the prefill. The decode speed is almost the same on my Ryzen 7950x (\~37 tokens/s), but the prefill speed is not yet optimized (almost the same as decode). This model can comfortably run on a machine with 16GB of RAM. Its memory usage will fit within \~7GB. You can reuse the weights between multiple Agent instances, each with their own KV cache. You can also clone Agent object instances if your agents have the same prompt so that you don't need to repeat the prefill work on the prompt.
Original Article
View Cached Full Text

Cached at: 06/09/26, 02:46 PM

TL;DR: I released a pure Rust, pure CPU implementation of the LFM2.5-8B-A1B model (4-bit Q4KM quantization), with decode speed around 37 tokens/s, memory usage ~7GB, embeddable in applications and shareable weights. ## Project Motivation Local large language models are becoming increasingly powerful; what an 8-billion-parameter model can do is already impressive. But GPUs are too expensive. CPU servers (VPS) cost only a few dollars per month, while cloud servers with GPUs are much more expensive. Many people want to run LLMs on old machines or cheap VPS for backend, automation, or experimentation. This project is designed for that scenario—a pure CPU inference engine, written entirely in Rust with minimal dependencies. ## Implementation Details - **Model**: Selected a 4-bit Q4KM quantized version of the LFM 2.5 8B model (GGUF format). - **Language & Hardware**: Pure Rust crate, zero external libraries, cross-platform SIMD primitives (AVX2 on Ryzen, ARM NEON on Apple M5). - **Deployment**: After downloading the weight file, load it via environment variable or default path, and link directly into your Rust application. - **API Design**: Create a model instance (stores weights), then you can create multiple `Agent` objects that share the weights. Each Agent has its own chat history and KV cache, suitable for handling different tasks simultaneously. ## Performance and Optimization In the initial port it was only 0.89 tokens/s. After several rounds of optimization—rewriting kernels, adding parallelism, enabling SIMD—performance improved dramatically: - On a Ryzen 7950X (16 threads): - Prefill: ~38 tokens/s - Decode: ~33 tokens/s - Compared to llama.cpp (forced CPU): llama's prefill is faster, but decode speeds are close. My implementation still has room for optimization and may catch up. ## Usage Examples 1. **Chat Interface**: Streams tokens, supports displaying “thinking tokens” for reasoning models. 2. **Simple Test**: Ask the model to introduce Montreal—results are mostly accurate, but when asked about restaurants it once fabricated a non-existent restaurant (no internet access). 3. **Tool Calling**: The model is trained for tool use. Currently I’m adding a callback mechanism so the model can call Rust functions. In demos handling CSV tables and JSON conversion tasks, the model performs well. ## Command Line and Configuration Offers a Builder‑pattern syntax for Agent configuration, plus command‑line options. Supports limiting thinking length, disabling thinking (though unstable). File contents can be embedded using the `@` syntax. ## Test Environment and Compatibility - Tested: Ryzen 7950X (with AVX2) and Apple M5 (M1–M5 should all work). - Theoretically runs on any Intel CPU with AVX2, even Raspberry Pi 5/4 (not tested, feedback welcome). - Memory requirement: ~7GB, comfortable on machines with 16GB RAM. ## Current Status and Future Plans - Code is open‑source: `github.com/maximcv/bb-lm` - Published as a cargo crate. - Added tool‑calling callbacks, more tests. - Plan to optimize prefill speed and match llama.cpp performance. - Will improve documentation and examples. ## Limitations - When thinking is disabled, the model may output raw `/think` text or go off‑track. - Prefill speed is currently weaker than llama.cpp, but decode speed is close. - No GPU acceleration, but CPU inference is already usable interactively. **Source**: YouTube video - maximecb (https://www.youtube.com/watch?v=whni8GW3xNM)

Similar Articles

@NFTCPS: 4GB VRAM running 70B large model? It actually works! AirLLM did a clever trick — layered inference, not loading the whole model into VRAM at once, but layer by layer, compute and discard, squeezing the giant into a small GPU. The best part: 100% open source, freebie warning https://github.com/0xSo…

X AI KOLs Timeline

AirLLM is a fully open-source tool that uses layered inference (loading and releasing VRAM layer by layer) to enable 70B large language models to run on GPUs with only 4GB VRAM, without quantization, distillation, or pruning. It already supports running Llama3.1 405B on 8GB VRAM.

New LFM2.5 8b A1b model!!

Reddit r/LocalLLaMA

Introducing LFM2.5 8b A1b, a new AI model with performance on par with Nemotron 3 Nano but at higher speed. Support is being added to SmallCode for non-standard tool calls.

LiquidAI/LFM2.5-8B-A1B-GGUF

Hugging Face Models Trending

LiquidAI releases a GGUF quantized version of their LFM2.5-8B-A1B model, with instructions for use across multiple inference engines.

@NFTCPS: Attention to those running large models locally! Someone has transformed llama.cpp into a performance beast — BeeLlama.cpp. With the same VRAM, inference speed triples and context capacity expands 7.5x. This isn't a slide deck; it's real benchmark data. It stuffs three top-tier optimizations into one codebase: DFlash speculative decoding…

X AI KOLs Timeline

BeeLlama.cpp is a fork of llama.cpp that integrates DFlash speculative decoding, TurboQuant/TCQ KV-cache compression, and adaptive draft control, achieving up to 3x faster inference and 7.5x context expansion on the same hardware.