MLX 16/8/4/2-bit quants of nvidia/llama-embed-nemotron-8b

Reddit r/LocalLLaMA 05/14/26, 05:52 PM Models

mlx quantization embedding llama nemotron apple-silicon open-source

Summary

The user converted Nvidia's Llama-Embed-Nemotron-8B model to MLX format with fp16, 8-bit, 4-bit, and 2-bit quantizations, enabling in-process embedding loading on Apple Silicon via mlx-embeddings.

I converted nvidia/llama-embed-nemotron-8b to MLX fp16, 8-bit, 4-bit, and 2-bit (for my OCD) and put it on HuggingFace: [ncorder/llama-embed-nemotron-8b-mlx-fp16](https://huggingface.co/ncorder/llama-embed-nemotron-8b-mlx-fp16) [ncorder/llama-embed-nemotron-8b-mlx-8bit](https://huggingface.co/ncorder/llama-embed-nemotron-8b-mlx-8bit) [ncorder/llama-embed-nemotron-8b-mlx-4bit](https://huggingface.co/ncorder/llama-embed-nemotron-8b-mlx-4bit) [ncorder/llama-embed-nemotron-8b-mlx-2bit](https://huggingface.co/ncorder/llama-embed-nemotron-8b-mlx-2bit) -- I was running this model using GGUFs + llama-server for local semantic search over an Obsidian vault and some other projects. It worked fine but I got tired of managing a whole HTTP server just for embeddings and also wanted Apple Silicon optimizations. The MLX version loads in-process via mlx-embeddings, no server. from mlx_embeddings import load_model, encode model, tokenizer = load_model("ncorder/llama-embed-nemotron-8b-mlx-4bit") embeddings = encode(model, tokenizer, ["your text here"]) Enjoy!

Original Article

Similar Articles

Qwen3.6-35B-A3B-Abliterated-Heretic-MLX-4bit

Reddit r/LocalLLaMA

The user reviews a quantized and fine-tuned version of the Qwen3.6-35B model optimized for Apple Silicon via MLX, praising its speed, intelligence, and lack of safety disclaimers.

We added W8A8 activation quantization to MLX — prefill went from 2.84s to 2.52s on M5 Pro

Reddit r/LocalLLaMA

Mininglamp AI released Cider, a small SDK that adds W8A8 activation quantization to Apple's MLX framework, achieving up to 1.84x speedup on prefill for large language models on M5 Pro via custom Metal kernels. The tool works with any MLX model, with INT8 TensorOps support for M5 and above.

Gemma4 26b MoE running in MLX with turboquant (and custom kernel)

Reddit r/LocalLLaMA

A developer successfully ran Gemma4 26b MoE on Apple MacBook Air M5 using MLX with turboquant and a custom kernel, achieving faster prompt processing and generation speeds than llama.cpp with lower memory usage. The implementation includes instructions for local deployment.

@neural_avb: I am working on porting SAM models and harness into Apple silicon. Already seeing 1.25x inference speed increase on mlx…

X AI KOLs Following

Porting SAM 2.1 models to Apple silicon with MLX, achieving 1.25x inference speed increase on the small model, with quantized versions planned.

@DivyanshT91162: Local LLMs just hit a whole new level This Hugging Face release is actually insane: "gpt-oss-20b-tq3" An official 20B+ …

X AI KOLs Timeline

A new 20B+ parameter MoE model from OpenAI, quantized to 3-bit via TurboQuant and optimized with MLX, allows for high-performance local LLM inference on standard 16GB MacBooks.

Similar Articles

Qwen3.6-35B-A3B-Abliterated-Heretic-MLX-4bit

We added W8A8 activation quantization to MLX — prefill went from 2.84s to 2.52s on M5 Pro

Gemma4 26b MoE running in MLX with turboquant (and custom kernel)

@neural_avb: I am working on porting SAM models and harness into Apple silicon. Already seeing 1.25x inference speed increase on mlx…

@DivyanshT91162: Local LLMs just hit a whole new level This Hugging Face release is actually insane: "gpt-oss-20b-tq3" An official 20B+ …

Submit Feedback