cpu-inference

#cpu-inference

I benchmarked PrismML's 1-bit Bonsai-8B against IBM's Granite on CPU tool calling. The 1-bit model won, but only with grammar-constrained decoding

Reddit r/LocalLLaMA ↗ · 5h ago

An independent benchmark of PrismML's 1-bit Bonsai-8B against IBM's Granite and other models on CPU tool calling shows that with grammar-constrained decoding, Bonsai-8B achieves a 92% pass rate, outperforming larger models, but fails without constraints. Granite is the best raw model at 72%.

0 favorites 0 likes

#cpu-inference

Little Brains, Big Feats: Exploring Compact Language Models

Hugging Face Daily Papers ↗ · 3d ago Cached

This paper benchmarks 17 compact language models (1B-8B parameters) as generators in Russian-language RAG systems under CPU-only inference, finding that Qwen-family models offer strong quality-latency tradeoffs for private, GPU-free deployment.

0 favorites 0 likes

#cpu-inference

@Oluwaphilemon1: Claude Fable 5 is dead and GPT-5.6 delaying launch… Microsoft has changed the game They've open-sourced bitnet.cpp, a 1…

X AI KOLs Timeline ↗ · 2026-06-22 Cached

Microsoft open-sourced bitnet.cpp, a 1-bit LLM inference framework that enables running 100B parameter models on local CPUs without GPUs, achieving 6.17x faster inference and 82.2% less energy consumption.

0 favorites 0 likes

#cpu-inference

I forked ik_llama.cpp and added a "--numa mirror" mode to maximize performance on multi-socket CPU systems. Just sharing and looking for testers!

Reddit r/LocalLLaMA ↗ · 2026-06-21

A developer forked ik_llama.cpp and added a '--numa mirror' mode that duplicates model weights and KV cache across NUMA nodes to maximize multi-socket CPU inference performance, sharing benchmarks and seeking testers.

0 favorites 0 likes

#cpu-inference

Cheapest way to run GLM 5.x locally that's not a unified memory system?

Reddit r/LocalLLaMA ↗ · 2026-06-17

A discussion on the cheapest local hardware setups for running GLM 5.x and similarly sized models at 4-bit quantization, including CPU-only and multi-GPU options, with a user sharing their experience running Minimax 2.7 and Qwen 3.6 on a 5900X + 128GB DDR4 + 7900XT setup.

0 favorites 0 likes

#cpu-inference

PSA: Test your "threads" argument in llama.cpp (+80% performance in my case)

Reddit r/LocalLLaMA ↗ · 2026-06-12

A user benchmarks thread count for hybrid CPU-GPU inference with Gemma 4 in llama.cpp, discovering a 80% performance uplift by using 16 threads instead of 6 on a hybrid core CPU, and shares the optimal command configuration.

0 favorites 0 likes

#cpu-inference

You don't need a GPU to run gemma-4-26B-A4B

Reddit r/LocalLLaMA ↗ · 2026-06-07

The author demonstrates that the Gemma-4-26B-A4B model runs efficiently on a CPU-only system using Koboldcpp, achieving 7 tokens per second on an old desktop, suggesting that powerful GPUs may not be necessary for local LLM inference.

0 favorites 0 likes

#cpu-inference

Benchmark: ONNX Runtime vs HF Transformers vs GGUF for Parakeet TDT 0.6B on CPU-only hardware [D]

Reddit r/MachineLearning ↗ · 2026-06-05

A benchmark comparing ONNX Runtime, HF Transformers, and GGUF for the Parakeet TDT 0.6B ASR model on CPU-only hardware shows ONNX Runtime achieves 37% faster inference than HF Transformers bfloat16, while GGUF prioritizes memory efficiency.

0 favorites 0 likes

#cpu-inference

A 10 year old Xeon is all you need

Hacker News Top ↗ · 2026-06-01 Cached

A blog post detailing how to run the Gemma 4 AI model on a 10-year-old Xeon server with only CPU and DDR3 RAM, using customized llama.cpp optimizations.

0 favorites 0 likes

#cpu-inference

@tunguz: Here is one big reason why this matters. Time spent on non-LLM inference tasks is only going to increase. However, tool…

X AI KOLs Following ↗ · 2026-05-23 Cached

A post highlights that 42% of time in modern agentic coding is spent on CPU-based tool use, which is inefficient and presents a major opportunity to redesign these tools for AI agents.

0 favorites 0 likes

#cpu-inference

@cocktailpeanut: Run Stable Audio 3 on ANY Computer with NO VRAM 1-click launcher for the official Stable Audio 3 gradio app. 1. Cross P…

X AI KOLs Following ↗ · 2026-05-21 Cached

A 1-click launcher for Stable Audio 3 allows running the model on any computer without a GPU, including CPU-only systems, and is cross-platform (Mac, Linux, Windows).

0 favorites 0 likes

#cpu-inference

Local LLM CPU users... How long is it taking you to do anything?

Reddit r/openclaw ↗ · 2026-05-20

A discussion about the performance of running large language models locally on CPU, especially with large context sizes, and the challenges of VRAM constraints.

0 favorites 0 likes

#cpu-inference

@FeitengLi: A 99M parameter TTS runs on CPU, faster than a 2B model on A100. Supertone's newly open-sourced supertonic-3 with ONNX Runtime, fully local, can run in browser, on phone, and even on Raspberry Pi.

X AI KOLs Timeline ↗ · 2026-05-15 Cached

Supertone released Supertonic 3, an open-source TTS model with 99M parameters that runs faster on CPU than a 2B model on A100, supporting 31 languages and ONNX Runtime for fully local inference.

0 favorites 0 likes

#cpu-inference

ggml-org/llama.cpp

GitHub Trending (daily) ↗ · 2026-05-18 Cached

llama.cpp is an open-source C/C++ library for efficient LLM inference on local hardware, supporting various quantization methods and multiple backends (CPU, GPU, etc.).

0 favorites 0 likes

cpu-inference

Submit Feedback