Tag
An independent benchmark of PrismML's 1-bit Bonsai-8B against IBM's Granite and other models on CPU tool calling shows that with grammar-constrained decoding, Bonsai-8B achieves a 92% pass rate, outperforming larger models, but fails without constraints. Granite is the best raw model at 72%.
This paper benchmarks 17 compact language models (1B-8B parameters) as generators in Russian-language RAG systems under CPU-only inference, finding that Qwen-family models offer strong quality-latency tradeoffs for private, GPU-free deployment.
Microsoft open-sourced bitnet.cpp, a 1-bit LLM inference framework that enables running 100B parameter models on local CPUs without GPUs, achieving 6.17x faster inference and 82.2% less energy consumption.
A developer forked ik_llama.cpp and added a '--numa mirror' mode that duplicates model weights and KV cache across NUMA nodes to maximize multi-socket CPU inference performance, sharing benchmarks and seeking testers.
A discussion on the cheapest local hardware setups for running GLM 5.x and similarly sized models at 4-bit quantization, including CPU-only and multi-GPU options, with a user sharing their experience running Minimax 2.7 and Qwen 3.6 on a 5900X + 128GB DDR4 + 7900XT setup.
A user benchmarks thread count for hybrid CPU-GPU inference with Gemma 4 in llama.cpp, discovering a 80% performance uplift by using 16 threads instead of 6 on a hybrid core CPU, and shares the optimal command configuration.
The author demonstrates that the Gemma-4-26B-A4B model runs efficiently on a CPU-only system using Koboldcpp, achieving 7 tokens per second on an old desktop, suggesting that powerful GPUs may not be necessary for local LLM inference.
A benchmark comparing ONNX Runtime, HF Transformers, and GGUF for the Parakeet TDT 0.6B ASR model on CPU-only hardware shows ONNX Runtime achieves 37% faster inference than HF Transformers bfloat16, while GGUF prioritizes memory efficiency.
A blog post detailing how to run the Gemma 4 AI model on a 10-year-old Xeon server with only CPU and DDR3 RAM, using customized llama.cpp optimizations.
A post highlights that 42% of time in modern agentic coding is spent on CPU-based tool use, which is inefficient and presents a major opportunity to redesign these tools for AI agents.
A 1-click launcher for Stable Audio 3 allows running the model on any computer without a GPU, including CPU-only systems, and is cross-platform (Mac, Linux, Windows).
A discussion about the performance of running large language models locally on CPU, especially with large context sizes, and the challenges of VRAM constraints.
Supertone released Supertonic 3, an open-source TTS model with 99M parameters that runs faster on CPU than a 2B model on A100, supporting 31 languages and ONNX Runtime for fully local inference.
llama.cpp is an open-source C/C++ library for efficient LLM inference on local hardware, supporting various quantization methods and multiple backends (CPU, GPU, etc.).