MiniMax-M3-EAGLE3-GGUF - Llama.cpp compatible MiniMax M3 EAGLE draft model!

Reddit r/LocalLLaMA Tools

Summary

A GGUF conversion of MiniMax M3's EAGLE draft model for llama.cpp is now available, enabling speculative decoding speedups on compatible hardware.

Hi all! With a new PR for llama.cpp, MiniMax M3's EAGLE decoder by Inferact/MiniMax-M3-EAGLE3 has successfully been converted to GGUF and runs without issue! The HF repo has instructions for both merging in the PR and running the model. I tested this on a 2x3090 and 128GB DDR4 system running the UD-Q2_K_XL quant and went from 2.3 tk/s to 5 tk/s, thanks to --fit and ensuring the draft model was in VRAM instead of RAM. It can be found here: https://huggingface.co/tonjum/MiniMax-M3-EAGLE3-GGUF
Original Article

Similar Articles

EAGLE3 has landed in llama.cpp

Reddit r/LocalLLaMA

EAGLE3, a speculative decoding method, has been integrated into llama.cpp, enabling faster inference.

unsloth/MiniMax-M3-GGUF

Hugging Face Models Trending

Unsloth releases a GGUF quantized version of the MiniMax-M3 multimodal model, enabling image-text-to-text tasks with support for Transformers, llama.cpp, vLLM, and other inference engines.

unsloth/North-Mini-Code-1.0-GGUF · Hugging Face

Reddit r/LocalLLaMA

This page hosts GGUF quantized versions of Cohere's North-Mini-Code-1.0 model, a 30B-A3B MoE model optimized for code generation and agentic tasks. Instructions are provided for building llama.cpp from a specific PR to support the cohere2moe architecture.

Unsloth Minimax M3 GGUF

Reddit r/LocalLLaMA

Unsloth is uploading a GGUF quantized version of the MiniMax M3 model to Hugging Face.