Bitnet.cpp: Efficient Edge Inference for Ternary LLMs
Summary
Bitnet.cpp presents a mixed-precision matrix multiplication library for efficient edge inference of ternary LLMs like BitNet b1.58, achieving up to 6.25x speedup over full-precision baselines. The system is open-sourced on GitHub.
View Cached Full Text
Cached at: 06/25/26, 11:09 AM
Paper page - Bitnet.cpp: Efficient Edge Inference for Ternary LLMs
Source: https://huggingface.co/papers/2502.11880 Published on Feb 17, 2025
Abstract
Bitnet.cpp enhances edge inference for ternary LLMs using a novel mixed-precision matrix multiplication library, achieving significant speed improvements over baselines.
The advent of 1-bit large language models (LLMs), led byBitNet b1.58, has spurred interest internary LLMs. Despite this, research and practical applications focusing on efficient edge inference forternary LLMsremain scarce. To bridge this gap, we introduceBitnet.cpp, an inference system optimized forBitNet b1.58andternary LLMs. Given that mixed-precision matrix multiplication (mpGEMM) constitutes the bulk of inference time internary LLMs,Bitnet.cppincorporates a novel mpGEMM library to facilitate sub-2-bits-per-weight, efficient and lossless inference. The library features two core solutions:Ternary Lookup Table(TL), which addresses spatial inefficiencies of previous bit-wise methods, andInt2 with a Scale(I2_S), which ensures lossless edge inference, both enabling high-speed inference. Our experiments show thatBitnet.cppachieves up to a 6.25x increase in speed over full-precision baselines and up to 2.32x over low-bit baselines, setting new benchmarks in the field. Additionally, we expand TL to element-wise lookup table (ELUT) for low-bit LLMs in the appendix, presenting both theoretical and empirical evidence of its considerable potential.Bitnet.cppis publicly available at https://github.com/microsoft/BitNet/tree/paper , offering a sophisticated solution for the efficient and practical deployment of edge LLMs.
View arXiv pageView PDFGitHub39.5kAdd to collection
Get this paper in your agent:
hf papers read 2502\.11880
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper1
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2502.11880 in a dataset README.md to link it from this page.
Spaces citing this paper2
Collections including this paper2
Similar Articles
BitNet Text Embeddings
This paper introduces BitEmbed, an extreme low-bit framework for LLM-based text embeddings that converts pretrained LLM backbones into BitNet-style encoders with ternary weights and quantized activations. It achieves comparable performance to full-precision models while significantly reducing encoding and storage costs.
Was BitNet a dead end? What happened to ternary LLMs?
The article questions why ternary language models like BitNet have not scaled beyond 2B parameters, given their initial promise, and discusses the apparent lack of progress from open-weight AI labs.
CAT-Q: Cost-efficient and Accurate Ternary Quantization for LLMs
CAT-Q introduces a post-training ternary quantization method for LLMs that uses learnable modulation and softened ternarization, achieving superior performance over BitNet 1.58-bit while using only 512 calibration samples and scaling to 235B parameters.
@AdinaYakup: BitCPM4-CANN Native 1.58-bit LLM training system on Ascend NPUs https://huggingface.co/collections/openbmb/bitcpm4-cann…
OpenBMB releases BitCPM4-CANN, a collection of natively trained 1.58-bit ternary quantized LLMs (0.5B to 8B) optimized for Ascend NPUs via CANN, achieving 6× memory reduction at inference and minimal training overhead.
NEW BITNET MODELS!
New BitCPM4-CANN models (1B, 3B, 8B) from OpenBMB released on Hugging Face; awaiting llamacpp support for testing.