MiMo-V2.5-coder

Reddit r/LocalLLaMA 05/25/26, 08:39 AM Models

quantization coding tool-calling gguf apple-silicon llama.cpp open-source

Summary

A quantized GGUF build of Xiaomi's MiMo-V2.5 model, tuned for coding and tool-calling on 128GB Apple Silicon systems, prioritizing reliable tool calls and code generation.

Hi, I've just released MiMo-V2.5-coder. If you have 128 Gb, this is an excellent alternative to Qwen3.6 and DS4, especially for coding. Fast, and with reliable tool calling. Give it a try!

Original Article

View Cached Full Text

Cached at: 05/25/26, 10:17 AM

jedisct1/MiMo-V2.5-coder-Q2 · Hugging Face

Source: https://huggingface.co/jedisct1/MiMo-V2.5-coder-Q2

https://huggingface.co/jedisct1/MiMo-V2.5-coder-Q2#mimo-v25-coder-q2-ggufMiMo-V2.5 Coder Q2 GGUF

This is a local, self-quantized GGUF build ofXiaomiMiMo/MiMo-V2.5, tuned for coding and tool-calling on a 128 GB Apple Silicon M5 machine.

This quant was optimized for systems with 128 GB of memory. The default serving profile targets a 128 GB Apple Silicon machine and tries to keep the model practical at a 100,000-token context. Smaller-memory systems will likely need more aggressive CPU offload, a smaller context, or a different quant.

It is a text-only quantization. The original MiMo-V2.5 checkpoint is omnimodal, but this GGUF does not include the vision or audio encoders. The MiMo multi-token prediction blocks were also omitted during conversion because normal llama.cpp generation does not currently execute them for this model.

https://huggingface.co/jedisct1/MiMo-V2.5-coder-Q2#quantizationQuantization

High-level summary:

Quant type:Q2\_K\_S
Importance matrix: coding and tool-calling focused
Preserved higher precision for embeddings, output, attention, and the dense first FFN
MoE down-expert tensors:Q3\_K
Reported quantized size: about 108,496.76 MiB at 2.95 BPW

One tokenizer metadata fix is included so llama.cpp does not warn about the base-vocab</s\>token at load time. MiMo’s actual EOS token remains<\|im\_end\|\>.

https://huggingface.co/jedisct1/MiMo-V2.5-coder-Q2#imatrixImatrix

This build deliberately prioritizes:

reliable OpenAI-compatible tool calls
coding and shell-oriented agent use
English prompts and codebase work
practical inference on a 128 GB Apple Silicon system

Chinese-language quality and multimodal use were not optimization targets.

https://huggingface.co/jedisct1/MiMo-V2.5-coder-Q2#servingServing

llama-server \
  -hf jedisct1/MiMo-V2.5-coder-Q2 \
  --host 127.0.0.1 \
  --port 8080 \
  --ctx-size 100000 \
  --parallel 1 \
  --batch-size 512 \
  --ubatch-size 128 \
  --threads 12 \
  --threads-batch 18 \
  --prio 0 \
  --poll 80 \
  --flash-attn on \
  --jinja \
  --fit on \
  --fit-target 4096 \
  --fit-ctx 100000 \
  --gpu-layers auto \
  --cache-type-k f16 \
  --cache-type-v f16 \
  --reasoning off

This starts an OpenAI-compatible server on127\.0\.0\.1:8080. The repository contains one GGUF split set, so recent llama.cpp builds should select the first shard automatically.

If you cloned or downloaded the repository locally, you can use the included helper script instead:

./run-server.sh

The script uses the same defaults and loads the first GGUF shard from the repository directory.

Default serving settings:

MIMO_CTX=100000
MIMO_FIT_CTX=100000
MIMO_FIT_TARGET=4096
MIMO_BATCH=512
MIMO_UBATCH=128
MIMO_REASONING=off
MIMO_CPU_MOE=0

These defaults are tuned for an Apple M5 Max with 128 GB unified memory. They keep reasoning output disabled, use the model’s Jinja chat template, use Flash Attention, and ask llama.cpp to fit as much of the model as possible onto Metal.

If you hit memory pressure, use the safer CPU-MoE mode:

MIMO_CPU_MOE=1 MIMO_FIT_TARGET=32768 MIMO_BATCH=128 MIMO_UBATCH=64 ./run-server.sh

That mode is slower, especially on long prompt prefill, but it leaves much more Metal memory headroom.

You can point directly at a differentllama\-serverbinary with:

LLAMA_SERVER=/path/to/llama-server ./run-server.sh

You can also runllama\-serverdirectly against local files without the helper script:

llama-server \
  --model MiMo-V2.5-coder-Q2-00001-of-00016.gguf \
  --host 127.0.0.1 \
  --port 8080 \
  --ctx-size 100000 \
  --parallel 1 \
  --batch-size 512 \
  --ubatch-size 128 \
  --threads 12 \
  --threads-batch 18 \
  --prio 0 \
  --poll 80 \
  --flash-attn on \
  --jinja \
  --fit on \
  --fit-target 4096 \
  --fit-ctx 100000 \
  --gpu-layers auto \
  --cache-type-k f16 \
  --cache-type-v f16 \
  --reasoning off

For the safer CPU-MoE fallback, add\-\-cpu\-moeand use a larger fit margin:

llama-server \
  --model MiMo-V2.5-coder-Q2-00001-of-00016.gguf \
  --ctx-size 100000 \
  --fit on \
  --fit-target 32768 \
  --fit-ctx 100000 \
  --batch-size 128 \
  --ubatch-size 64 \
  --flash-attn on \
  --jinja \
  --gpu-layers auto \
  --cache-type-k f16 \
  --cache-type-v f16 \
  --reasoning off \
  --cpu-moe

https://huggingface.co/jedisct1/MiMo-V2.5-coder-Q2#tool-calling-notesTool-Calling Notes

For best tool-calling results:

Use theSwivalharness - it should work with anything using OpenAI-like tool calling convention, but it is tested with Swival.
Disable model reasoning output with\-\-reasoning offorMIMO\_REASONING=off.
Setparallel\_tool\_callstofalseif your client supports it.
Avoid forcingtool\_choice: required; in testing it made malformed calls more likely.

https://huggingface.co/jedisct1/MiMo-V2.5-coder-Q2#licenseLicense

The upstream model card forXiaomiMiMo/MiMo\-V2\.5declares the MIT license. This derived GGUF is provided under the same license metadata.

MiMo-V2.5-coder

jedisct1/MiMo-V2.5-coder-Q2 · Hugging Face

https://huggingface.co/jedisct1/MiMo-V2.5-coder-Q2#mimo-v25-coder-q2-ggufMiMo-V2.5 Coder Q2 GGUF

https://huggingface.co/jedisct1/MiMo-V2.5-coder-Q2#quantizationQuantization

https://huggingface.co/jedisct1/MiMo-V2.5-coder-Q2#imatrixImatrix

https://huggingface.co/jedisct1/MiMo-V2.5-coder-Q2#servingServing

https://huggingface.co/jedisct1/MiMo-V2.5-coder-Q2#tool-calling-notesTool-Calling Notes

https://huggingface.co/jedisct1/MiMo-V2.5-coder-Q2#licenseLicense

Similar Articles

XiaomiMiMo/MiMo-V2.5-Pro-FP4-DFlash

Tested Xiaomi's MiMo V2.5 Pro for autonomous coding: 301 commits, 60+ pages, $70 in API costs. Now it's open-source.

XiaomiMiMo/MiMo-V2.5-Pro

Jiunsong/supergemma4-26b-uncensored-mlx-4bit-v2

@port_dev: https://x.com/port_dev/status/2054259445732110408

Submit Feedback

Similar Articles

XiaomiMiMo/MiMo-V2.5-Pro-FP4-DFlash

Tested Xiaomi's MiMo V2.5 Pro for autonomous coding: 301 commits, 60+ pages, $70 in API costs. Now it's open-source.

Jiunsong/supergemma4-26b-uncensored-mlx-4bit-v2

@port_dev: https://x.com/port_dev/status/2054259445732110408