@amitiitbhu: Gemma 4 now runs 2x faster with MTP GGUFs! Run locally on just 6GB RAM. New Article: How does GGUF work? Read here: htt…

X AI KOLs Timeline 06/12/26, 03:49 AM Models

gemma-4 gguf mtp local-inference quantization performance

Summary

Gemma 4 now runs 2x faster with MTP GGUF format and can run locally on just 6GB RAM. The linked article explains how GGUF works, including quantization and memory mapping.

Gemma 4 now runs 2x faster with MTP GGUFs! Run locally on just 6GB RAM. New Article: How does GGUF work? Read here: https://outcomeschool.com/blog/how-does-gguf-work…

Original Article

View Cached Full Text

Cached at: 06/12/26, 10:56 AM

Gemma 4 now runs 2x faster with MTP GGUFs! Run locally on just 6GB RAM. New Article: How does GGUF work? Read here: https://outcomeschool.com/blog/how-does-gguf-work…

How does GGUF work?

Source: https://outcomeschool.com/blog/how-does-gguf-work How does GGUF work?

In this blog, we will learn about how GGUF works. We will also see what problem it solves, what is stored inside a GGUF file, how quantization makes big models fit on a normal laptop, and where it is used in real tools.

We will cover the following:

What is a model and what are weights
What is local inference
The problem before GGUF
What is GGUF
What is stored inside a GGUF file
What is quantization
Understanding quantization names like Q4_K_M
How GGUF loads fast with memory mapping
Why GGUF is cross-platform and extensible
GGUF in the real world

I amAmit Shekhar, Founder @Outcome School, I have taught and mentored many developers, and their efforts landed them high-paying tech jobs, helped many tech companies in solving their unique problems, and created many open-source libraries being used by top companies. I am passionate about sharing knowledge through open-source, blogs, and videos.

I teachAI and Machine Learningat Outcome School.

Let’s get started.

What is a model and what are weights

Before we talk about GGUF, we must first understand a few simple things.

Alarge language model, orLLM, is the technology behind tools like ChatGPT and Claude. We give it some text, and it gives us back some text.

Now, the natural question is, what is this model actually made of?

A model is made of a huge collection of numbers called weights.

In simple words, the weights are the numbers the model learned during training. When the model was trained, it read a lot of text and slowly adjusted these numbers until it became good at predicting the next word. These learned numbers are the brain of the model.

Let’s say a model has 7 billion weights. That means it is a list of 7 billion numbers. When we ask the model a question, it does a lot of math using these numbers to produce an answer.

These weights are usually stored astensors. A tensor is just a fancy word for a grid of numbers. For the sake of understanding, we can simply think of tensors as big tables full of the model’s learned numbers. Do not worry, we do not need any math here, we only need to remember that a tensor is a block of numbers.

So, a model is basically a giant pile of numbers, the weights, arranged as tensors.

This is the foundation we needed. Now let’s understand where we want to run this model.

What is local inference

When we use the model to get an answer, that act of running the model is calledinference.

In simple words, training is when the model learns, and inference is when the model is used. Inference is the model doing its job, taking our text and producing a reply.

Now, there are two places we can run inference.

The first is on a big server in the cloud, far away, owned by a company. We send our text over the internet, the server runs the model, and it sends the answer back.

The second islocal inference. Local means on our own machine. So, local inference is running the model directly on our own laptop or computer, without sending anything to a far-away server.

Let’s picture both places as below:

CLOUD INFERENCE:

  our laptop  ---- our text over internet ---->  cloud server (runs model)
  our laptop  <--- answer over internet --------  cloud server

LOCAL INFERENCE:

  our laptop (runs model)
     our text  --->  model  --->  answer
     nothing leaves the machine

Here, we can see that with cloud inference our text travels over the internet to a far-away server and the answer travels back. With local inference, everything happens on our own laptop, and nothing leaves the machine.

People want local inference for good reasons. It keeps our data private, because nothing leaves our machine. It works without the internet. And it has no per-use cost from a cloud provider.

So, our goal is clear. We want to take this giant pile of weights and run it on a normal laptop. Now let’s see why this was hard before GGUF.

The problem before GGUF

To run a model on our laptop, the program needs more than just the weights. It needs a few things together:

Theweights, which are the model’s learned numbers, stored as tensors.
Thetokenizer, which is the tool that breaks our text into small pieces the model can read, and joins the pieces back into text. We will understand this better soon.
Theconfig and metadata, which is the information that describes the model, for example what kind of model it is, how it is built, and how much text it can handle at once.

In the early days, these pieces were often scattered across many separate files in different formats. One file for the weights, other files for the tokenizer, another file for the settings. To load the model, the program had to find all of them, read each one correctly, and stitch them together.

This caused real problems:

**It was fragile.**If one file was missing or in a slightly different format, loading failed.
**It was not portable.**Sharing a model meant sharing a folder of many files, and another person’s tool may read them differently.
**It was slow to load.**Reading and combining many files takes time before the model is even ready.

There was an earlier format calledGGMLthat tried to pack things together for running models on normal computers. GGML was a good start and made local inference possible. But as models grew and changed, GGML had limits. It was not flexible enough, and adding new information to the file was painful. Every time the model design changed, the format struggled to keep up.

Let’s picture the old scattered way against the single-file goal as below:

THE OLD WAY (scattered files):

  +----------------+   +------------------+   +-----------------+
  |  weights file  |   |  tokenizer file  |   |  settings file  |
  +----------------+   +------------------+   +-----------------+
          |                    |                      |
          +--------------------+----------------------+
                               |
                       program must find,
                       read, and stitch all
                       of them together

  fragile, not portable, slow to load

THE GOAL (one organized file):

  +----------------------------------------+
  |  weights + tokenizer + settings        |
  |  all packed together in one file       |
  +----------------------------------------+

  open it and run, nothing to stitch

Here, we can see that the old way kept the weights, the tokenizer, and the settings in separate files, so the program had to find each one, read it correctly, and stitch them together. The goal is one organized file that holds everything, so the program can open it and run right away.

We needed a single, well-organized file that holds everything, loads fast, and is easy to extend.

So, here comes GGUF to the rescue.

What is GGUF

GGUF is a single file format that stores everything needed to run a large language model for local inference, all in one self-contained file.

GGUF stands forGPT-Generated Unified Format. The key idea for us is in the wordsUnified Format, where “Unified” means everything the model needs is brought together into one single, organized file, instead of being scattered across many separate files.

In simple words, GGUF is one file that contains the model’s weights, the tokenizer, and all the settings, packed neatly together so a program can open it and start running the model right away.

It is the successor to the older GGML format. GGUF was created to fix GGML’s limits, and today it is the standard format for running models locally.

Let’s use a simple analogy. Suppose we want to move into a new house and we need a bed, a table, and a chair. The old way was like getting these as loose parts in three different boxes, with instructions in three different languages, and we had to assemble everything ourselves. GGUF is like getting one neatly packed box that has everything inside, clearly labeled, ready to use the moment we open it.

So, GGUF is the one box that holds the whole model, ready for our laptop to run.

Now, let’s open the box and see what is inside.

What is stored inside a GGUF file

A GGUF file is built to hold three kinds of things together. Let’s look at each one.

**First, the tensors.**These are the model’s weights, the learned numbers we talked about earlier. This is the biggest part of the file, because a model can have billions of numbers.

**Second, the key-value metadata.**Metadata is data that describes other data. It is stored as simple pairs of a key and a value, like a label and its answer. For example, a key could be the architecture, which means the type of model and how it is built, and its value tells the program which kind of model this is. Another key could be the context length, which means how many tokens the model can read at once, and its value is a number like 4096. There are many such pairs that fully describe the model.

Third, the tokenizer.A model cannot read raw text. It first breaks the text into small pieces calledtokens. A token is a small chunk of text, roughly a word or part of a word. The tokenizer is the tool that does this splitting and also joins tokens back into text. GGUF stores the tokenizer’s vocabulary and rules right inside the file, so the program does not need any extra file to understand our text.

We have a detailed blog onByte Pair Encoding in LLMsthat explains how this tokenization step works.

Let’s picture the layout of a GGUF file as below:

A SINGLE GGUF FILE

+--------------------------------------------------+
|  HEADER                                          |
|    marker "GGUF" + version + counts              |
+--------------------------------------------------+
|  KEY-VALUE METADATA                              |
|    architecture      = "llama"                   |
|    context length    = 4096                      |
|    tokenizer vocab   = [ ...tokens... ]          |
|    quantization info = ...                       |
+--------------------------------------------------+
|  TENSOR INFO (a small table of contents)         |
|    names, shapes, and where each tensor sits     |
+--------------------------------------------------+
|  TENSOR DATA (the weights, the big part)         |
|    [ billions of numbers, the model's brain ]    |
+--------------------------------------------------+

Here, we can see that the file starts with a smallheader. The header begins with the four lettersGGUF, which is a marker that tells any program “this is a GGUF file”. After that comes the key-value metadata, which describes the model in plain labeled pairs. Then comes a small tensor info table, which is like a table of contents that lists every tensor’s name, its shape, and where it sits in the file. Finally comes the actual tensor data, the huge block of weights.

So, with one file, the program has the weights, the description, and the tokenizer, all in one place. The problem of scattered files is solved.

Now, there is still one big challenge. These models are huge. Let’s see how GGUF helps them fit on a normal laptop.

What is quantization

A model with billions of weights is very large. If each weight is stored as a very detailed, very exact number, the file becomes too big to fit in a laptop’s memory.

Let’s understand the size problem with a simple idea. Each number can be stored using a certain number ofbits. A bit is the smallest unit of computer memory, a single 0 or 1. The more bits we use per number, the more precise the number is, but the more space it takes.

Suppose every weight is stored using 16 bits. A model with 7 billion weights would then need about 14 gigabytes just for the weights. That is too big for many laptops to handle comfortably.

So, here comesquantizationto the rescue.

Quantization is the technique of storing each weight using fewer bits, so the model becomes much smaller.

In simple words, quantization means we round the model’s numbers to a simpler, shorter form that takes less space.

Let’s use a simple analogy. Suppose a price is 19.997 rupees. If we round it to 20 rupees, it is shorter and easier to store, and for most purposes it is close enough. We lost a tiny bit of exactness, but we saved space. Quantization does the same thing to the model’s weights. It stores them with fewer bits, so each number is a little less exact but takes much less room.

Let’s see the effect with numbers. If we drop from 16 bits per weight down to about 4 bits per weight, our 7 billion weight model shrinks from around 14 gigabytes to roughly 4 gigabytes. Now it fits on a normal laptop and can run on its CPU or GPU.

But, here is the catch. There is a trade-off.

**Fewer bits means smaller size and faster running, but slightly lower quality.**The answers can become a little less accurate, because we made the numbers less exact.
**More bits means larger size and slower running, but higher quality.**The answers stay closer to the original model.

So, we choose the level of quantization based on our use case. If we have a small laptop and want speed, we pick a smaller form. If we have more memory and want the best quality, we pick a larger form.

GGUF supports many quantization levels, and it stores the quantized weights directly inside the file. This is one of the biggest reasons GGUF is so useful for local inference.

Quantization is one way to make a model smaller. Another is knowledge distillation, where a small model learns to copy a larger one. We have a detailed blog onhow Knowledge Distillation worksthat explains this in depth.

Now, these quantization levels have names that look strange at first, likeQ4\_K\_M. Let’s decode them.

Understanding quantization names like Q4_K_M

When we download a GGUF model, we will see names likeQ4\_K\_M,Q5\_K\_M, andQ8\_0. These look confusing, but they follow a simple pattern. Let’s break one down.

TakeQ4\_K\_Mas below:

Q4_K_M
 | | |
 | | +---  M  =  the size variant (S = small, M = medium, L = large)
 | +-----  K  =  a smarter, modern quantization method
 +-------  4  =  about 4 bits used per weight

Here, we can see that the name has three parts. TheQsimply means quantized. The number after it, the4, tells us roughly how many bits are used per weight. SoQ4means about 4 bits per weight, andQ8means about 8 bits per weight. TheKmeans it uses a smarter, modern method that spends bits more wisely to keep quality high. The last letter,M, is a size variant, whereSis small,Mis medium, andLis large.

So, the simple rule to remember is this:

The number is the bits per weight. A bigger number means more bits, which means bigger file and better quality. A smaller number means fewer bits, which means smaller file and slightly lower quality.

Let me tabulate the common quantization levels for your better understanding so that you can decide which one to use based on your use case.

NameBits per weightFile sizeQualityGood forQ4\_K\_Mabout 4smallest of thesegoodlaptops with limited memory, a great balanceQ5\_K\_Mabout 5mediumbettera bit more memory, slightly better answersQ8\_0about 8largest of thesebestplenty of memory, quality matters mostHere, we can notice thatQ4\_K\_Mgives the smallest size with good quality, which is why it is one of the most popular choices. As we move toQ5\_K\_Mand thenQ8\_0, the file gets bigger and the quality improves, but we need more memory to run it.

**Note:**If we are not sure which one to pick,Q4\_K\_Mis a safe starting point for most laptops, because it balances size and quality very well. If our answers feel a little off and we have spare memory, we can move up toQ5\_K\_MorQ8\_0.

So, now we know how to read these names and pick the right one based on our use case.

To master Quantization and Model Compression, check out theAI and Machine Learning Programby Outcome School.

How GGUF loads fast with memory mapping

We learned that a GGUF model can still be a few gigabytes. Now the question is, how does it start so fast without filling up our memory?

The answer is a technique calledmemory mapping, often written asmmap.

Let’s first understand the slow way. The simple way to load a file is to read the whole thing from the disk into memory before we use it. For a 4 gigabyte model, that means waiting until all 4 gigabytes are copied into memory. That is slow, and it uses a lot of memory right away.

So, here comes memory mapping to the rescue.

Memory mapping lets the program treat the file on disk as if it were already in memory, without copying the whole thing first.

In simple words, instead of loading everything up front, the program only reads the parts of the file it actually needs, exactly when it needs them.

Let’s use a simple analogy. Suppose we have a thick book. The slow way is to photocopy the entire book before reading a single page. Memory mapping is like keeping the book on the table and simply opening the exact page we need, only when we need it. We do not copy the whole book first. We just read pages on demand.

Let’s see the difference as below:

WITHOUT memory mapping:

  disk file (4 GB)  ===> copy all 4 GB into memory ===> then start
       slow start, uses a lot of memory immediately

WITH memory mapping (GGUF):

  disk file (4 GB)  --- start immediately --->
       the program reads only the needed parts, when needed
       fast start, memory used efficiently

Here, we can see that without memory mapping the program must copy all 4 gigabytes before it even starts, which is slow and heavy on memory. With memory mapping, the program starts right away and pulls in only the parts of the model it needs, when it needs them. This is why a GGUF model can start so quickly.

GGUF is designed to work perfectly with memory mapping. Because the tensors are laid out in a clean, ordered way inside the file, the program can jump straight to any tensor it needs and read it directly. The model starts fast and uses memory efficiently.

This is how GGUF gives us a fast startup on a normal machine.

If we want to go deep into LLM Inference Optimization, we have a complete program on it - check out theAI and Machine Learning Programby Outcome School.

Why GGUF is cross-platform and extensible

There are two more qualities of GGUF that make it so widely used. Let’s understand both.

**First, GGUF is cross-platform.**Cross-platform means it works the same way across different operating systems and devices. The same GGUF file runs on Windows, macOS, and Linux, and on different kinds of processors. We do not need a different file for each system. We download one GGUF file and it just works wherever we run it.

**Second, GGUF is extensible.**Extensible means it is easy to add new information to the format without breaking older files. Remember, the metadata inside GGUF is stored as simple key-value pairs. So, when a new kind of model needs a new setting, the format simply adds a new key-value pair. Old programs can still read the file, and new programs can read the new key. This is exactly the weakness that the older GGML format had, and GGUF fixed it.

So, GGUF is portable across machines and flexible enough to grow with new models. This is why it became the standard.

Now, let’s see where GGUF is actually used.

GGUF in the real world

GGUF is the format used byllama.cppand the popular tools built on top of it.

Let’s understandllama\.cppfirst. It is an open-source program written to run large language models efficiently on normal computers, including laptops, using the CPU or the GPU. It is fast, lightweight, and it reads models in the GGUF format. GGUF was created as part of thisllama\.cppworld to be the clean, single-file format these models use.

Many friendly tools are built on top ofllama\.cpp, and they all use GGUF:

**Ollama.**This is a tool that lets us download and run models locally with a simple command. Under the hood, it uses GGUF files andllama\.cpp.
**LM Studio.**This is a desktop app with a nice screen where we can search for models, download GGUF files, and chat with them, all on our own machine.

Let’s picture how these pieces stack together as below:

+-----------------+        +-----------------+
  |     Ollama      |        |    LM Studio    |   the friendly tools we use
  +-----------------+        +-----------------+
            |                         |
            +------------+------------+
                         |
                +------------------+
                |    llama.cpp     |   runs the model efficiently
                +------------------+
                         |
                +------------------+
                |  .gguf file      |   weights + tokenizer + metadata
                +------------------+

Here, we can see that the GGUF file sits at the bottom holding everything the model needs. Thellama\.cppprogram reads that file and runs the model. The friendly tools like Ollama and LM Studio sit on top ofllama\.cpp, giving us a simple way to use it. So they all rely on the same single GGUF file underneath.

In all of these, the flow is the same. We download a single file that ends with\.gguffor the model and the quantization level we want, the tool opens it using memory mapping, reads the weights, the tokenizer, and the metadata from that one file, and we start chatting. There is nothing to assemble and no scattered files to manage.

So, anywhere we want to run a large language model on our own machine, GGUF is very likely the format we will use.

The models we run locally like this are often Small Language Models, compact enough to fit on a laptop or phone. We have a detailed blog onSmall Language Models (SLMs)that explains why smaller models matter.

This is how GGUF works. It packs the model’s weights, its tokenizer, and its metadata into one self-contained file, it uses quantization to shrink the weights so big models fit on a normal laptop, and it loads quickly with memory mapping, which is why it became the standard format for running large language models locally.

Prepare yourself for AI Engineering Interview:AI Engineering Interview Questions

That’s it for now.

Thanks

Amit Shekhar Founder @Outcome School

You can connect with me on:

Follow Outcome School on:

Read all of our high-quality blogs here.

@amitiitbhu: Gemma 4 now runs 2x faster with MTP GGUFs! Run locally on just 6GB RAM. New Article: How does GGUF work? Read here: htt…

How does GGUF work?

What is a model and what are weights

What is local inference

The problem before GGUF

What is GGUF

What is stored inside a GGUF file

What is quantization

Understanding quantization names like Q4_K_M

How GGUF loads fast with memory mapping

Why GGUF is cross-platform and extensible

GGUF in the real world

Similar Articles

@UnslothAI: Gemma 4 12B can now run locally on just 8GB RAM via Dynamic GGUFs. Google's new model, Gemma 4 12B Unified supports ima…

@_philschmid: More Gemma 4! New QAT Gemma 4 checkpoints with similar performance while using ~4x less memory! It comes with a new mob…

@Freerunnering: This actually makes Gemma 4 26B-4A usable for a coding agent @ 72tk/s on my MacBook Pro M1 Max. This video is realtime,…

@osanseviero: Gemma 4 MTP just got officially merged into llama.cpp This means you can use Gemma 4 QAT + MTP for a lightweight + supe…

Gemma 4 26B-A4B GGUF Benchmarks

Submit Feedback

Similar Articles

@UnslothAI: Gemma 4 12B can now run locally on just 8GB RAM via Dynamic GGUFs. Google's new model, Gemma 4 12B Unified supports ima…

@_philschmid: More Gemma 4! New QAT Gemma 4 checkpoints with similar performance while using ~4x less memory! It comes with a new mob…

@Freerunnering: This actually makes Gemma 4 26B-4A usable for a coding agent @ 72tk/s on my MacBook Pro M1 Max. This video is realtime,…

@osanseviero: Gemma 4 MTP just got officially merged into llama.cpp This means you can use Gemma 4 QAT + MTP for a lightweight + supe…

Gemma 4 26B-A4B GGUF Benchmarks