I benchmarked PrismML's 1-bit Bonsai-8B against IBM's Granite on CPU tool calling. The 1-bit model won, but only with grammar-constrained decoding

Reddit r/LocalLLaMA 07/02/26, 01:32 PM News

1-bit-models tool-calling benchmark bonsai-8b granite llama-cpp grammar-constrained-decoding cpu-inference

Summary

An independent benchmark of PrismML's 1-bit Bonsai-8B against IBM's Granite and other models on CPU tool calling shows that with grammar-constrained decoding, Bonsai-8B achieves a 92% pass rate, outperforming larger models, but fails without constraints. Granite is the best raw model at 72%.

Everyone keeps asking if the 1-bit models are actually usable for agents, so I ran the numbers myself. Couldn't find a single independent tool-calling eval of Bonsai-8B anywhere. Not on the BFCL leaderboard, nothing on BenchLM. So as far as I can tell this is the first one. Setup: 30 deterministic tool-call cases (single, parallel, sequential, abstention, format), temp 0, mainline llama.cpp on CPU. Each model runs twice: once raw, once with a GBNF grammar constraining the output to valid tool-call JSON. Results (PASS rate, raw / with grammar): Bonsai-8B Q1_0 (1.16 GB): 0% / 92% Granite-4.1-3B Q4_K_M (2.0 GB): 72% / 88% Qwen2.5-Coder-3B: 0% / 84% Qwen2.5-Coder-7B: 68% / 84% Qwen3-8B: 0% / 84% BitNet-b1.58-2B: 0% / 44% The Bonsai result surprised me. Raw, it's useless for tool calling. 0% valid output. With the grammar active it posted the best score of anything I've tested, from a file half the size of a 3B Q4. Perfect on format, parallel, sequential and abstention categories. Granite is the opposite story. Best raw model by far at 72%. If you can't or don't want to run grammars, that's your pick. Takeaway for me: the "1-bit models can't do agents" claim needs a footnote. They can't do agents unconstrained. Put a grammar in front and the semantic capability is apparently there, at least on this small benchmark. Caveats before anyone gets too excited: 30 cases, temp 0, single run, my own harness. That's a signal, not a leaderboard. Happy to share the case set, it's all in the repo.

Original Article

I benchmarked PrismML's 1-bit Bonsai-8B against IBM's Granite on CPU tool calling. The 1-bit model won, but only with grammar-constrained decoding

Similar Articles

prism-ml/bonsai-image-ternary-4B-gemlite-2bit

1-Bit Bonsai Image 4B Image Generation for Local Devices

PrismML just released Binary and Ternary Bonsai Image 4B: 1-bit/ternary text-to-image diffusion transformers that can even run 100% locally in your browser on WebGPU.

Ternary Bonsai: Top Intelligence at 1.58 Bits

@no_stp_on_snek: someone will wave the card at me: the 9B crushes its base on the coding benchmarks (SWE-bench 69 vs 53). true. but on m…

Submit Feedback

Similar Articles

prism-ml/bonsai-image-ternary-4B-gemlite-2bit

1-Bit Bonsai Image 4B Image Generation for Local Devices

PrismML just released Binary and Ternary Bonsai Image 4B: 1-bit/ternary text-to-image diffusion transformers that can even run 100% locally in your browser on WebGPU.

Ternary Bonsai: Top Intelligence at 1.58 Bits

@no_stp_on_snek: someone will wave the card at me: the 9B crushes its base on the coding benchmarks (SWE-bench 69 vs 53). true. but on m…