I benchmarked PrismML's 1-bit Bonsai-8B against IBM's Granite on CPU tool calling. The 1-bit model won, but only with grammar-constrained decoding

Reddit r/LocalLLaMA News

Summary

An independent benchmark of PrismML's 1-bit Bonsai-8B against IBM's Granite and other models on CPU tool calling shows that with grammar-constrained decoding, Bonsai-8B achieves a 92% pass rate, outperforming larger models, but fails without constraints. Granite is the best raw model at 72%.

Everyone keeps asking if the 1-bit models are actually usable for agents, so I ran the numbers myself. Couldn't find a single independent tool-calling eval of Bonsai-8B anywhere. Not on the BFCL leaderboard, nothing on BenchLM. So as far as I can tell this is the first one. Setup: 30 deterministic tool-call cases (single, parallel, sequential, abstention, format), temp 0, mainline llama.cpp on CPU. Each model runs twice: once raw, once with a GBNF grammar constraining the output to valid tool-call JSON. Results (PASS rate, raw / with grammar): Bonsai-8B Q1_0 (1.16 GB): 0% / 92% Granite-4.1-3B Q4_K_M (2.0 GB): 72% / 88% Qwen2.5-Coder-3B: 0% / 84% Qwen2.5-Coder-7B: 68% / 84% Qwen3-8B: 0% / 84% BitNet-b1.58-2B: 0% / 44% The Bonsai result surprised me. Raw, it's useless for tool calling. 0% valid output. With the grammar active it posted the best score of anything I've tested, from a file half the size of a 3B Q4. Perfect on format, parallel, sequential and abstention categories. Granite is the opposite story. Best raw model by far at 72%. If you can't or don't want to run grammars, that's your pick. Takeaway for me: the "1-bit models can't do agents" claim needs a footnote. They can't do agents unconstrained. Put a grammar in front and the semantic capability is apparently there, at least on this small benchmark. Caveats before anyone gets too excited: 30 cases, temp 0, single run, my own harness. That's a signal, not a leaderboard. Happy to share the case set, it's all in the repo.
Original Article

Similar Articles

prism-ml/bonsai-image-ternary-4B-gemlite-2bit

Hugging Face Models Trending

Prism ML releases Bonsai Image, a 1.21 GB text-to-image diffusion transformer using ternary weights (1.58-bit) for NVIDIA GPUs, offering 4.5s / 1024² on RTX 3080 and much smaller than FP16.

1-Bit Bonsai Image 4B Image Generation for Local Devices

Hacker News Top

PrismML releases Bonsai Image 4B, a family of compact image generation models using 1-bit and ternary weights, enabling high-quality diffusion inference on local devices like laptops and iPhones with significantly reduced memory footprint.

Ternary Bonsai: Top Intelligence at 1.58 Bits

Hacker News Top

A highly efficient AI model architecture using ternary weights (-1, 0, 1) that achieves competitive performance while requiring only 1.58 bits per parameter, enabling deployment on extremely constrained devices.