Bitnet.cpp: Efficient Edge Inference for Ternary LLMs

Papers with Code Trending Papers

Summary

Bitnet.cpp presents a mixed-precision matrix multiplication library for efficient edge inference of ternary LLMs like BitNet b1.58, achieving up to 6.25x speedup over full-precision baselines. The system is open-sourced on GitHub.

The advent of 1-bit large language models (LLMs), led by BitNet b1.58, has spurred interest in ternary LLMs. Despite this, research and practical applications focusing on efficient edge inference for ternary LLMs remain scarce. To bridge this gap, we introduce Bitnet.cpp, an inference system optimized for BitNet b1.58 and ternary LLMs. Given that mixed-precision matrix multiplication (mpGEMM) constitutes the bulk of inference time in ternary LLMs, Bitnet.cpp incorporates a novel mpGEMM library to facilitate sub-2-bits-per-weight, efficient and lossless inference. The library features two core solutions: Ternary Lookup Table (TL), which addresses spatial inefficiencies of previous bit-wise methods, and Int2 with a Scale (I2_S), which ensures lossless edge inference, both enabling high-speed inference. Our experiments show that Bitnet.cpp achieves up to a 6.25x increase in speed over full-precision baselines and up to 2.32x over low-bit baselines, setting new benchmarks in the field. Additionally, we expand TL to element-wise lookup table (ELUT) for low-bit LLMs in the appendix, presenting both theoretical and empirical evidence of its considerable potential. Bitnet.cpp is publicly available at https://github.com/microsoft/BitNet/tree/paper , offering a sophisticated solution for the efficient and practical deployment of edge LLMs.
Original Article
View Cached Full Text

Cached at: 06/25/26, 11:09 AM

Paper page - Bitnet.cpp: Efficient Edge Inference for Ternary LLMs

Source: https://huggingface.co/papers/2502.11880 Published on Feb 17, 2025

Abstract

Bitnet.cpp enhances edge inference for ternary LLMs using a novel mixed-precision matrix multiplication library, achieving significant speed improvements over baselines.

The advent of 1-bit large language models (LLMs), led byBitNet b1.58, has spurred interest internary LLMs. Despite this, research and practical applications focusing on efficient edge inference forternary LLMsremain scarce. To bridge this gap, we introduceBitnet.cpp, an inference system optimized forBitNet b1.58andternary LLMs. Given that mixed-precision matrix multiplication (mpGEMM) constitutes the bulk of inference time internary LLMs,Bitnet.cppincorporates a novel mpGEMM library to facilitate sub-2-bits-per-weight, efficient and lossless inference. The library features two core solutions:Ternary Lookup Table(TL), which addresses spatial inefficiencies of previous bit-wise methods, andInt2 with a Scale(I2_S), which ensures lossless edge inference, both enabling high-speed inference. Our experiments show thatBitnet.cppachieves up to a 6.25x increase in speed over full-precision baselines and up to 2.32x over low-bit baselines, setting new benchmarks in the field. Additionally, we expand TL to element-wise lookup table (ELUT) for low-bit LLMs in the appendix, presenting both theoretical and empirical evidence of its considerable potential.Bitnet.cppis publicly available at https://github.com/microsoft/BitNet/tree/paper , offering a sophisticated solution for the efficient and practical deployment of edge LLMs.

View arXiv pageView PDFGitHub39.5kAdd to collection

Get this paper in your agent:

hf papers read 2502\.11880

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper1

#### Lgr54HFi/chimera

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2502.11880 in a dataset README.md to link it from this page.

Spaces citing this paper2

Collections including this paper2

Similar Articles

BitNet Text Embeddings

arXiv cs.CL

This paper introduces BitEmbed, an extreme low-bit framework for LLM-based text embeddings that converts pretrained LLM backbones into BitNet-style encoders with ternary weights and quantized activations. It achieves comparable performance to full-precision models while significantly reducing encoding and storage costs.

Was BitNet a dead end? What happened to ternary LLMs?

Reddit r/LocalLLaMA

The article questions why ternary language models like BitNet have not scaled beyond 2B parameters, given their initial promise, and discusses the apparent lack of progress from open-weight AI labs.

CAT-Q: Cost-efficient and Accurate Ternary Quantization for LLMs

arXiv cs.CL

CAT-Q introduces a post-training ternary quantization method for LLMs that uses learnable modulation and softened ternarization, achieving superior performance over BitNet 1.58-bit while using only 512 calibration samples and scaling to 235B parameters.

NEW BITNET MODELS!

Reddit r/LocalLLaMA

New BitCPM4-CANN models (1B, 3B, 8B) from OpenBMB released on Hugging Face; awaiting llamacpp support for testing.