I put together a Rust-native, CPU-only implementation of LFM2.5-8B-A1B

Reddit r/LocalLLaMA 06/09/26, 01:11 PM Tools

rust cpu-only llm-inference quantized-model open-source local-llm

Summary

The author released a pure Rust, CPU-only inference implementation of the LFM2.5-8B-A1B model (4-bit Q4KM quantization), achieving a decode speed of approximately 37 tokens/s and memory usage around 7GB. The goal is to make LLMs runnable on cheap VPS or older machines. The implementation is open source and published as a cargo crate.

This is still a work in progress, but since recording the video, I added callbacks for tool use, more tests, and published it as a cargo crate. Currently working on speeding up the prefill. The decode speed is almost the same on my Ryzen 7950x (\~37 tokens/s), but the prefill speed is not yet optimized (almost the same as decode). This model can comfortably run on a machine with 16GB of RAM. Its memory usage will fit within \~7GB. You can reuse the weights between multiple Agent instances, each with their own KV cache. You can also clone Agent object instances if your agents have the same prompt so that you don't need to repeat the prefill work on the prompt.

Original Article

View Cached Full Text

Cached at: 06/09/26, 02:46 PM

TL;DR: I released a pure Rust, pure CPU implementation of the LFM2.5-8B-A1B model (4-bit Q4KM quantization), with decode speed around 37 tokens/s, memory usage ~7GB, embeddable in applications and shareable weights. ## Project Motivation Local large language models are becoming increasingly powerful; what an 8-billion-parameter model can do is already impressive. But GPUs are too expensive. CPU servers (VPS) cost only a few dollars per month, while cloud servers with GPUs are much more expensive. Many people want to run LLMs on old machines or cheap VPS for backend, automation, or experimentation. This project is designed for that scenario—a pure CPU inference engine, written entirely in Rust with minimal dependencies. ## Implementation Details - **Model**: Selected a 4-bit Q4KM quantized version of the LFM 2.5 8B model (GGUF format). - **Language & Hardware**: Pure Rust crate, zero external libraries, cross-platform SIMD primitives (AVX2 on Ryzen, ARM NEON on Apple M5). - **Deployment**: After downloading the weight file, load it via environment variable or default path, and link directly into your Rust application. - **API Design**: Create a model instance (stores weights), then you can create multiple `Agent` objects that share the weights. Each Agent has its own chat history and KV cache, suitable for handling different tasks simultaneously. ## Performance and Optimization In the initial port it was only 0.89 tokens/s. After several rounds of optimization—rewriting kernels, adding parallelism, enabling SIMD—performance improved dramatically: - On a Ryzen 7950X (16 threads): - Prefill: ~38 tokens/s - Decode: ~33 tokens/s - Compared to llama.cpp (forced CPU): llama's prefill is faster, but decode speeds are close. My implementation still has room for optimization and may catch up. ## Usage Examples 1. **Chat Interface**: Streams tokens, supports displaying “thinking tokens” for reasoning models. 2. **Simple Test**: Ask the model to introduce Montreal—results are mostly accurate, but when asked about restaurants it once fabricated a non-existent restaurant (no internet access). 3. **Tool Calling**: The model is trained for tool use. Currently I’m adding a callback mechanism so the model can call Rust functions. In demos handling CSV tables and JSON conversion tasks, the model performs well. ## Command Line and Configuration Offers a Builder‑pattern syntax for Agent configuration, plus command‑line options. Supports limiting thinking length, disabling thinking (though unstable). File contents can be embedded using the `@` syntax. ## Test Environment and Compatibility - Tested: Ryzen 7950X (with AVX2) and Apple M5 (M1–M5 should all work). - Theoretically runs on any Intel CPU with AVX2, even Raspberry Pi 5/4 (not tested, feedback welcome). - Memory requirement: ~7GB, comfortable on machines with 16GB RAM. ## Current Status and Future Plans - Code is open‑source: `github.com/maximcv/bb-lm` - Published as a cargo crate. - Added tool‑calling callbacks, more tests. - Plan to optimize prefill speed and match llama.cpp performance. - Will improve documentation and examples. ## Limitations - When thinking is disabled, the model may output raw `/think` text or go off‑track. - Prefill speed is currently weaker than llama.cpp, but decode speed is close. - No GPU acceleration, but CPU inference is already usable interactively. **Source**: YouTube video - maximecb (https://www.youtube.com/watch?v=whni8GW3xNM)

I put together a Rust-native, CPU-only implementation of LFM2.5-8B-A1B

Similar Articles

@QingQ77: Pure Rust LLM inference engine with custom CUDA kernels for each hardware × model × quantization combination, achieving higher inference speed than vLLM and TensorRT-LLM. https://github.com/Avarok-Cybersecurity/a…

@mylifcc: LiteLLM officially migrated to Rust! AI Gateway gets an epic performance upgrade: per-request overhead reduced by 150x (~0.05ms vs Python 7.5ms), throughput increased by 15x, memory usage reduced by 11x (peak only 32MB), single...

New LFM2.5 8b A1b model!!

LiquidAI/LFM2.5-8B-A1B-GGUF

Submit Feedback

Similar Articles

@QingQ77: Pure Rust LLM inference engine with custom CUDA kernels for each hardware × model × quantization combination, achieving higher inference speed than vLLM and TensorRT-LLM. https://github.com/Avarok-Cybersecurity/a…

@mylifcc: LiteLLM officially migrated to Rust! AI Gateway gets an epic performance upgrade: per-request overhead reduced by 150x (~0.05ms vs Python 7.5ms), throughput increased by 15x, memory usage reduced by 11x (peak only 32MB), single...

@NFTCPS: 4GB VRAM running 70B large model? It actually works! AirLLM did a clever trick — layered inference, not loading the whole model into VRAM at once, but layer by layer, compute and discard, squeezing the giant into a small GPU. The best part: 100% open source, freebie warning https://github.com/0xSo…