@0xSero: Locally Part 1 - Apple Silicon Macs give you large pools of memory to run big models, but the token generation speed wi…

X AI KOLs Following 04/22/26, 12:18 PM Tools

Summary

Apple Silicon Macs offer large memory pools for running big models but with slower token generation, performing best with large MoEs that have low active parameters.

Locally Part 1 - Apple Silicon Macs give you large pools of memory to run big models, but the token generation speed will be lower than most are used to. Macs are best with large MoEs that have low ACTIVE params. Basically when you see a model like Qwen3.5-397B-A17B this

Original Article

View Cached Full Text

Cached at: 04/22/26, 03:00 PM

Similar Articles

SwiftLM: Pure-Swift Apple Silicon LLM inference server—no Python, runs big models on low-RAM Macs

X AI KOLs Timeline

SwiftLM is a Swift-native LLM inference server for Apple Silicon that runs large models without Python, using SSD streaming to load MoE weights and enabling 122B models on 64 GB Macs.

@julien_c: and is Apple Silicon the King of Local AI?

X AI KOLs Following

Discussion on whether Apple Silicon is the best hardware for running local AI models, referencing a linked article or thread.

@MemoryReboot_: Why Mac Studio is a trap for local AI - Large unified memory looks sexy on paper - Great for chatbots, terrible for 24/…

X AI KOLs Timeline

The article argues that the Mac Studio is a poor choice for 24/7 local AI workflows due to the lack of CUDA support and non-upgradable hardware, despite its large unified memory.

@sitinme: There's a pretty interesting open-source project called Cider, specifically designed to accelerate local AI inference on Macs with Apple Silicon chips. Many people buy a Mac mini or MacBook Pro and want to run models locally, but often encounter issues like insufficient speed and high memory usage. Actually...

X AI KOLs Timeline

Cider is an open-source project designed for Apple Silicon Macs, accelerating local AI inference by fully leveraging the computing power of M-series chips. It is compatible with the MLX ecosystem, supports models like Qwen and Llama, and is easy to install.

@awnihannun: It's very cool that Apple shipped a 20B parameter on-device. You can't put 20B parameters in RAM at any reasonable prec…

X AI KOLs Following

Apple shipped a 20B parameter on-device model using a MoE variant that selects experts once per query to fit in NAND, enabling inference despite RAM constraints.

Similar Articles

SwiftLM: Pure-Swift Apple Silicon LLM inference server—no Python, runs big models on low-RAM Macs

@julien_c: and is Apple Silicon the King of Local AI?

@MemoryReboot_: Why Mac Studio is a trap for local AI - Large unified memory looks sexy on paper - Great for chatbots, terrible for 24/…

@awnihannun: It's very cool that Apple shipped a 20B parameter on-device. You can't put 20B parameters in RAM at any reasonable prec…

Submit Feedback