nvidia/llama-embed-nemotron-8b 的 MLX 16/8/4/2 位量化版本

Reddit r/LocalLLaMA 2026/05/14 17:52 模型

mlx quantization embedding llama nemotron apple-silicon open-source

摘要

用户将 Nvidia 的 Llama-Embed-Nemotron-8B 模型转换为 MLX 格式，包含 fp16、8位、4位和2位量化，从而能够通过 mlx-embeddings 在 Apple Silicon 上实现在进程内加载嵌入向量。

我将 nvidia/llama-embed-nemotron-8b 转换为 MLX fp16、8位、4位和2位量化（为了满足我的强迫症），并上传到了 HuggingFace：[ncorder/llama-embed-nemotron-8b-mlx-fp16](https://huggingface.co/ncorder/llama-embed-nemotron-8b-mlx-fp16) [ncorder/llama-embed-nemotron-8b-mlx-8bit](https://huggingface.co/ncorder/llama-embed-nemotron-8b-mlx-8bit) [ncorder/llama-embed-nemotron-8b-mlx-4bit](https://huggingface.co/ncorder/llama-embed-nemotron-8b-mlx-4bit) [ncorder/llama-embed-nemotron-8b-mlx-2bit](https://huggingface.co/ncorder/llama-embed-nemotron-8b-mlx-2bit) ——我之前使用 GGUFs 和 llama-server 运行此模型，对 Obsidian 笔记库和其他项目进行本地语义搜索。它运行良好，但我厌倦了仅为了嵌入向量而管理整个 HTTP 服务器，并且还想要 Apple Silicon 的优化。MLX 版本通过 mlx-embeddings 在进程内加载，无需服务器。 from mlx_embeddings import load_model, encode model, tokenizer = load_model("ncorder/llama-embed-nemotron-8b-mlx-4bit") embeddings = encode(model, tokenizer, ["your text here"]) 尽情使用吧！

查看原文

nvidia/llama-embed-nemotron-8b 的 MLX 16/8/4/2 位量化版本

相似文章

Qwen3.6-35B-A3B-Abliterated-Heretic-MLX-4bit

@DivyanshT91162: 本地 LLM 迈入了一个全新的阶段。这个 Hugging Face 的发布简直是疯了：“gpt-oss-20b-tq3” 这是一个官方的 200 亿+ …

jundot/omlx

JANGQ-AI/MiniMax-M2.7-JANGTQ_K : MiniMax M2.7 的混合位量化版本 - 磁盘占用 74 GB

Jiunsong/supergemma4-26b-uncensored-mlx-4bit-v2

提交意见反馈