MiMo-V2.5-coder
Summary
A quantized GGUF build of Xiaomi's MiMo-V2.5 model, tuned for coding and tool-calling on 128GB Apple Silicon systems, prioritizing reliable tool calls and code generation.
View Cached Full Text
Cached at: 05/25/26, 10:17 AM
jedisct1/MiMo-V2.5-coder-Q2 · Hugging Face
Source: https://huggingface.co/jedisct1/MiMo-V2.5-coder-Q2
https://huggingface.co/jedisct1/MiMo-V2.5-coder-Q2#mimo-v25-coder-q2-ggufMiMo-V2.5 Coder Q2 GGUF
This is a local, self-quantized GGUF build ofXiaomiMiMo/MiMo-V2.5, tuned for coding and tool-calling on a 128 GB Apple Silicon M5 machine.
This quant was optimized for systems with 128 GB of memory. The default serving profile targets a 128 GB Apple Silicon machine and tries to keep the model practical at a 100,000-token context. Smaller-memory systems will likely need more aggressive CPU offload, a smaller context, or a different quant.
It is a text-only quantization. The original MiMo-V2.5 checkpoint is omnimodal, but this GGUF does not include the vision or audio encoders. The MiMo multi-token prediction blocks were also omitted during conversion because normal llama.cpp generation does not currently execute them for this model.
https://huggingface.co/jedisct1/MiMo-V2.5-coder-Q2#quantizationQuantization
High-level summary:
- Quant type:
Q2\_K\_S - Importance matrix: coding and tool-calling focused
- Preserved higher precision for embeddings, output, attention, and the dense first FFN
- MoE down-expert tensors:
Q3\_K - Reported quantized size: about 108,496.76 MiB at 2.95 BPW
One tokenizer metadata fix is included so llama.cpp does not warn about the base-vocab</s\>token at load time. MiMo’s actual EOS token remains<\|im\_end\|\>.
https://huggingface.co/jedisct1/MiMo-V2.5-coder-Q2#imatrixImatrix
This build deliberately prioritizes:
- reliable OpenAI-compatible tool calls
- coding and shell-oriented agent use
- English prompts and codebase work
- practical inference on a 128 GB Apple Silicon system
Chinese-language quality and multimodal use were not optimization targets.
https://huggingface.co/jedisct1/MiMo-V2.5-coder-Q2#servingServing
llama-server \
-hf jedisct1/MiMo-V2.5-coder-Q2 \
--host 127.0.0.1 \
--port 8080 \
--ctx-size 100000 \
--parallel 1 \
--batch-size 512 \
--ubatch-size 128 \
--threads 12 \
--threads-batch 18 \
--prio 0 \
--poll 80 \
--flash-attn on \
--jinja \
--fit on \
--fit-target 4096 \
--fit-ctx 100000 \
--gpu-layers auto \
--cache-type-k f16 \
--cache-type-v f16 \
--reasoning off
This starts an OpenAI-compatible server on127\.0\.0\.1:8080. The repository contains one GGUF split set, so recent llama.cpp builds should select the first shard automatically.
If you cloned or downloaded the repository locally, you can use the included helper script instead:
./run-server.sh
The script uses the same defaults and loads the first GGUF shard from the repository directory.
Default serving settings:
MIMO_CTX=100000
MIMO_FIT_CTX=100000
MIMO_FIT_TARGET=4096
MIMO_BATCH=512
MIMO_UBATCH=128
MIMO_REASONING=off
MIMO_CPU_MOE=0
These defaults are tuned for an Apple M5 Max with 128 GB unified memory. They keep reasoning output disabled, use the model’s Jinja chat template, use Flash Attention, and ask llama.cpp to fit as much of the model as possible onto Metal.
If you hit memory pressure, use the safer CPU-MoE mode:
MIMO_CPU_MOE=1 MIMO_FIT_TARGET=32768 MIMO_BATCH=128 MIMO_UBATCH=64 ./run-server.sh
That mode is slower, especially on long prompt prefill, but it leaves much more Metal memory headroom.
You can point directly at a differentllama\-serverbinary with:
LLAMA_SERVER=/path/to/llama-server ./run-server.sh
You can also runllama\-serverdirectly against local files without the helper script:
llama-server \
--model MiMo-V2.5-coder-Q2-00001-of-00016.gguf \
--host 127.0.0.1 \
--port 8080 \
--ctx-size 100000 \
--parallel 1 \
--batch-size 512 \
--ubatch-size 128 \
--threads 12 \
--threads-batch 18 \
--prio 0 \
--poll 80 \
--flash-attn on \
--jinja \
--fit on \
--fit-target 4096 \
--fit-ctx 100000 \
--gpu-layers auto \
--cache-type-k f16 \
--cache-type-v f16 \
--reasoning off
For the safer CPU-MoE fallback, add\-\-cpu\-moeand use a larger fit margin:
llama-server \
--model MiMo-V2.5-coder-Q2-00001-of-00016.gguf \
--ctx-size 100000 \
--fit on \
--fit-target 32768 \
--fit-ctx 100000 \
--batch-size 128 \
--ubatch-size 64 \
--flash-attn on \
--jinja \
--gpu-layers auto \
--cache-type-k f16 \
--cache-type-v f16 \
--reasoning off \
--cpu-moe
https://huggingface.co/jedisct1/MiMo-V2.5-coder-Q2#tool-calling-notesTool-Calling Notes
For best tool-calling results:
- Use theSwivalharness - it should work with anything using OpenAI-like tool calling convention, but it is tested with Swival.
- Disable model reasoning output with
\-\-reasoning offorMIMO\_REASONING=off. - Set
parallel\_tool\_callstofalseif your client supports it. - Avoid forcing
tool\_choice: required; in testing it made malformed calls more likely.
https://huggingface.co/jedisct1/MiMo-V2.5-coder-Q2#licenseLicense
The upstream model card forXiaomiMiMo/MiMo\-V2\.5declares the MIT license. This derived GGUF is provided under the same license metadata.
Similar Articles
XiaomiMiMo/MiMo-V2.5-Pro-FP4-DFlash
XiaomiMiMo releases MiMo-V2.5-Pro-FP4-DFlash, an FP4-quantized MoE model with block-diffusion speculative decoding to reduce memory and bandwidth for trillion-parameter inference.
Tested Xiaomi's MiMo V2.5 Pro for autonomous coding: 301 commits, 60+ pages, $70 in API costs. Now it's open-source.
Xiaomi has open-sourced its MiMo V2.5 Pro model, a 1.02T parameter MoE model designed for autonomous coding tasks. The article details a real-world test showing high efficiency with low API costs due to high cache hit rates.
XiaomiMiMo/MiMo-V2.5-Pro
Xiaomi releases MiMo-V2.5-Pro, an open-source MoE language model with 1.02T total parameters and 1M token context, optimized for complex agentic and software engineering tasks.
Jiunsong/supergemma4-26b-uncensored-mlx-4bit-v2
SuperGemma4-26B-Uncensored-MLX-4bit-v2 is a fine-tuned and quantized variant of Google's Gemma 4 26B optimized for Apple Silicon, offering improved performance on code, reasoning, and tool-use tasks while maintaining faster inference speeds compared to the stock baseline.
@port_dev: https://x.com/port_dev/status/2054259445732110408
The article provides a detailed tutorial on setting up a local coding agent using Qwen3.6-27B via Unsloth Studio and the Pi coding harness. It highlights the benefits of using GGUF quantized models for efficient inference on consumer hardware like Apple Silicon Macs.