MLX 16/8/4/2-bit quants of nvidia/llama-embed-nemotron-8b

Reddit r/LocalLLaMA Models

Summary

The user converted Nvidia's Llama-Embed-Nemotron-8B model to MLX format with fp16, 8-bit, 4-bit, and 2-bit quantizations, enabling in-process embedding loading on Apple Silicon via mlx-embeddings.

I converted nvidia/llama-embed-nemotron-8b to MLX fp16, 8-bit, 4-bit, and 2-bit (for my OCD) and put it on HuggingFace: [ncorder/llama-embed-nemotron-8b-mlx-fp16](https://huggingface.co/ncorder/llama-embed-nemotron-8b-mlx-fp16) [ncorder/llama-embed-nemotron-8b-mlx-8bit](https://huggingface.co/ncorder/llama-embed-nemotron-8b-mlx-8bit) [ncorder/llama-embed-nemotron-8b-mlx-4bit](https://huggingface.co/ncorder/llama-embed-nemotron-8b-mlx-4bit) [ncorder/llama-embed-nemotron-8b-mlx-2bit](https://huggingface.co/ncorder/llama-embed-nemotron-8b-mlx-2bit) -- I was running this model using GGUFs + llama-server for local semantic search over an Obsidian vault and some other projects. It worked fine but I got tired of managing a whole HTTP server just for embeddings and also wanted Apple Silicon optimizations. The MLX version loads in-process via mlx-embeddings, no server. from mlx_embeddings import load_model, encode model, tokenizer = load_model("ncorder/llama-embed-nemotron-8b-mlx-4bit") embeddings = encode(model, tokenizer, ["your text here"]) Enjoy!
Original Article

Similar Articles

Qwen3.6-35B-A3B-Abliterated-Heretic-MLX-4bit

Reddit r/LocalLLaMA

The user reviews a quantized and fine-tuned version of the Qwen3.6-35B model optimized for Apple Silicon via MLX, praising its speed, intelligence, and lack of safety disclaimers.

Gemma4 26b MoE running in MLX with turboquant (and custom kernel)

Reddit r/LocalLLaMA

A developer successfully ran Gemma4 26b MoE on Apple MacBook Air M5 using MLX with turboquant and a custom kernel, achieving faster prompt processing and generation speeds than llama.cpp with lower memory usage. The implementation includes instructions for local deployment.