MiMo-V2.5-coder

Reddit r/LocalLLaMA Models

Summary

A quantized GGUF build of Xiaomi's MiMo-V2.5 model, tuned for coding and tool-calling on 128GB Apple Silicon systems, prioritizing reliable tool calls and code generation.

Hi, I've just released MiMo-V2.5-coder. If you have 128 Gb, this is an excellent alternative to Qwen3.6 and DS4, especially for coding. Fast, and with reliable tool calling. Give it a try!
Original Article
View Cached Full Text

Cached at: 05/25/26, 10:17 AM

jedisct1/MiMo-V2.5-coder-Q2 · Hugging Face

Source: https://huggingface.co/jedisct1/MiMo-V2.5-coder-Q2

https://huggingface.co/jedisct1/MiMo-V2.5-coder-Q2#mimo-v25-coder-q2-ggufMiMo-V2.5 Coder Q2 GGUF

This is a local, self-quantized GGUF build ofXiaomiMiMo/MiMo-V2.5, tuned for coding and tool-calling on a 128 GB Apple Silicon M5 machine.

This quant was optimized for systems with 128 GB of memory. The default serving profile targets a 128 GB Apple Silicon machine and tries to keep the model practical at a 100,000-token context. Smaller-memory systems will likely need more aggressive CPU offload, a smaller context, or a different quant.

It is a text-only quantization. The original MiMo-V2.5 checkpoint is omnimodal, but this GGUF does not include the vision or audio encoders. The MiMo multi-token prediction blocks were also omitted during conversion because normal llama.cpp generation does not currently execute them for this model.

https://huggingface.co/jedisct1/MiMo-V2.5-coder-Q2#quantizationQuantization

High-level summary:

  • Quant type:Q2\_K\_S
  • Importance matrix: coding and tool-calling focused
  • Preserved higher precision for embeddings, output, attention, and the dense first FFN
  • MoE down-expert tensors:Q3\_K
  • Reported quantized size: about 108,496.76 MiB at 2.95 BPW

One tokenizer metadata fix is included so llama.cpp does not warn about the base-vocab</s\>token at load time. MiMo’s actual EOS token remains<\|im\_end\|\>.

https://huggingface.co/jedisct1/MiMo-V2.5-coder-Q2#imatrixImatrix

This build deliberately prioritizes:

  • reliable OpenAI-compatible tool calls
  • coding and shell-oriented agent use
  • English prompts and codebase work
  • practical inference on a 128 GB Apple Silicon system

Chinese-language quality and multimodal use were not optimization targets.

https://huggingface.co/jedisct1/MiMo-V2.5-coder-Q2#servingServing

llama-server \
  -hf jedisct1/MiMo-V2.5-coder-Q2 \
  --host 127.0.0.1 \
  --port 8080 \
  --ctx-size 100000 \
  --parallel 1 \
  --batch-size 512 \
  --ubatch-size 128 \
  --threads 12 \
  --threads-batch 18 \
  --prio 0 \
  --poll 80 \
  --flash-attn on \
  --jinja \
  --fit on \
  --fit-target 4096 \
  --fit-ctx 100000 \
  --gpu-layers auto \
  --cache-type-k f16 \
  --cache-type-v f16 \
  --reasoning off

This starts an OpenAI-compatible server on127\.0\.0\.1:8080. The repository contains one GGUF split set, so recent llama.cpp builds should select the first shard automatically.

If you cloned or downloaded the repository locally, you can use the included helper script instead:

./run-server.sh

The script uses the same defaults and loads the first GGUF shard from the repository directory.

Default serving settings:

MIMO_CTX=100000
MIMO_FIT_CTX=100000
MIMO_FIT_TARGET=4096
MIMO_BATCH=512
MIMO_UBATCH=128
MIMO_REASONING=off
MIMO_CPU_MOE=0

These defaults are tuned for an Apple M5 Max with 128 GB unified memory. They keep reasoning output disabled, use the model’s Jinja chat template, use Flash Attention, and ask llama.cpp to fit as much of the model as possible onto Metal.

If you hit memory pressure, use the safer CPU-MoE mode:

MIMO_CPU_MOE=1 MIMO_FIT_TARGET=32768 MIMO_BATCH=128 MIMO_UBATCH=64 ./run-server.sh

That mode is slower, especially on long prompt prefill, but it leaves much more Metal memory headroom.

You can point directly at a differentllama\-serverbinary with:

LLAMA_SERVER=/path/to/llama-server ./run-server.sh

You can also runllama\-serverdirectly against local files without the helper script:

llama-server \
  --model MiMo-V2.5-coder-Q2-00001-of-00016.gguf \
  --host 127.0.0.1 \
  --port 8080 \
  --ctx-size 100000 \
  --parallel 1 \
  --batch-size 512 \
  --ubatch-size 128 \
  --threads 12 \
  --threads-batch 18 \
  --prio 0 \
  --poll 80 \
  --flash-attn on \
  --jinja \
  --fit on \
  --fit-target 4096 \
  --fit-ctx 100000 \
  --gpu-layers auto \
  --cache-type-k f16 \
  --cache-type-v f16 \
  --reasoning off

For the safer CPU-MoE fallback, add\-\-cpu\-moeand use a larger fit margin:

llama-server \
  --model MiMo-V2.5-coder-Q2-00001-of-00016.gguf \
  --ctx-size 100000 \
  --fit on \
  --fit-target 32768 \
  --fit-ctx 100000 \
  --batch-size 128 \
  --ubatch-size 64 \
  --flash-attn on \
  --jinja \
  --gpu-layers auto \
  --cache-type-k f16 \
  --cache-type-v f16 \
  --reasoning off \
  --cpu-moe

https://huggingface.co/jedisct1/MiMo-V2.5-coder-Q2#tool-calling-notesTool-Calling Notes

For best tool-calling results:

  • Use theSwivalharness - it should work with anything using OpenAI-like tool calling convention, but it is tested with Swival.
  • Disable model reasoning output with\-\-reasoning offorMIMO\_REASONING=off.
  • Setparallel\_tool\_callstofalseif your client supports it.
  • Avoid forcingtool\_choice: required; in testing it made malformed calls more likely.

https://huggingface.co/jedisct1/MiMo-V2.5-coder-Q2#licenseLicense

The upstream model card forXiaomiMiMo/MiMo\-V2\.5declares the MIT license. This derived GGUF is provided under the same license metadata.

Similar Articles

XiaomiMiMo/MiMo-V2.5-Pro-FP4-DFlash

Hugging Face Models Trending

XiaomiMiMo releases MiMo-V2.5-Pro-FP4-DFlash, an FP4-quantized MoE model with block-diffusion speculative decoding to reduce memory and bandwidth for trillion-parameter inference.

XiaomiMiMo/MiMo-V2.5-Pro

Hugging Face Models Trending

Xiaomi releases MiMo-V2.5-Pro, an open-source MoE language model with 1.02T total parameters and 1M token context, optimized for complex agentic and software engineering tasks.

Jiunsong/supergemma4-26b-uncensored-mlx-4bit-v2

Hugging Face Models Trending

SuperGemma4-26B-Uncensored-MLX-4bit-v2 is a fine-tuned and quantized variant of Google's Gemma 4 26B optimized for Apple Silicon, offering improved performance on code, reasoning, and tool-use tasks while maintaining faster inference speeds compared to the stock baseline.

@port_dev: https://x.com/port_dev/status/2054259445732110408

X AI KOLs Timeline

The article provides a detailed tutorial on setting up a local coding agent using Qwen3.6-27B via Unsloth Studio and the Pi coding harness. It highlights the benefits of using GGUF quantized models for efficient inference on consumer hardware like Apple Silicon Macs.