@0xSero: Locally Part 1 - Apple Silicon Macs give you large pools of memory to run big models, but the token generation speed wi…
Summary
Apple Silicon Macs offer large memory pools for running big models but with slower token generation, performing best with large MoEs that have low active parameters.
View Cached Full Text
Cached at: 04/22/26, 03:00 PM
Locally Part 1 - Apple Silicon Macs give you large pools of memory to run big models, but the token generation speed will be lower than most are used to. Macs are best with large MoEs that have low ACTIVE params. Basically when you see a model like Qwen3.5-397B-A17B this
Similar Articles
SwiftLM: Pure-Swift Apple Silicon LLM inference server—no Python, runs big models on low-RAM Macs
SwiftLM is a Swift-native LLM inference server for Apple Silicon that runs large models without Python, using SSD streaming to load MoE weights and enabling 122B models on 64 GB Macs.
@julien_c: and is Apple Silicon the King of Local AI?
Discussion on whether Apple Silicon is the best hardware for running local AI models, referencing a linked article or thread.
@MemoryReboot_: Why Mac Studio is a trap for local AI - Large unified memory looks sexy on paper - Great for chatbots, terrible for 24/…
The article argues that the Mac Studio is a poor choice for 24/7 local AI workflows due to the lack of CUDA support and non-upgradable hardware, despite its large unified memory.
@sitinme: There's a pretty interesting open-source project called Cider, specifically designed to accelerate local AI inference on Macs with Apple Silicon chips. Many people buy a Mac mini or MacBook Pro and want to run models locally, but often encounter issues like insufficient speed and high memory usage. Actually...
Cider is an open-source project designed for Apple Silicon Macs, accelerating local AI inference by fully leveraging the computing power of M-series chips. It is compatible with the MLX ecosystem, supports models like Qwen and Llama, and is easy to install.
@awnihannun: It's very cool that Apple shipped a 20B parameter on-device. You can't put 20B parameters in RAM at any reasonable prec…
Apple shipped a 20B parameter on-device model using a MoE variant that selects experts once per query to fit in NAND, enabling inference despite RAM constraints.