@karminski3: Thinking of buying a Mac to run large models? This is a deterrent post. Actually, the estimation method is simple. Even if you buy a MacStudio to run the Qwen3.6-27B 4bit quantized version, then enable DFlash to use Qwen's built-in speculative decoding, it only reaches 65 token/s. And now most large models can run at 40 token/s…
Summary
The author calculates the token cost and break-even period of running large models on a Mac Studio, concluding that it is not cost-effective for ordinary users to buy a Mac for personal large model use, and suggests that using APIs or renting GPUs is more economical.
View Cached Full Text
Cached at: 06/22/26, 01:45 PM
Want to buy a Mac to run large models? This is a buyer’s remorse post.
Actually, the calculation is quite simple. Even if you buy a Mac Studio today to run Qwen3.6-27B 4-bit quantized version, enable DFlash with Qwen’s built-in speculative decoding, you’ll only get about 65 tokens/s. But nowadays, most large models can already reach 40 tokens/s.
If you specifically buy a Mac Studio M3 Ultra (96GB) to run large models, converting the device price (¥32,999) into API usage — take GLM-5.2 as an example, ¥28 per million tokens — one Mac Studio’s price gets you roughly 32,999 / 28 = 1,178 million tokens.
And to output those tokens, a Mac Studio running Qwen3.6-27B would need to run continuously for 209 days. That means your break-even period is at least 200 days of non-stop operation. Only after that does running models become pure profit.
This doesn’t even account for electricity costs, or buying API packages instead of pay-as-you-go. And most importantly, this is only for a small 27B model.
If you were to buy a 512GB Mac Studio (¥108,749 — and it seems to be out of stock anyway), then run a quantized version of GLM-5.2, the speed drops to only 17 tokens/s, and the break-even period becomes about 7 years…
Given that new model versions are released every 1.5 months, buying for personal use is absolutely not cost-effective. So for most users, buying a coding plan is much more worthwhile. If, like me, you need to test new models, renting GPU directly is also far better than buying.
Of course, if you already own a Mac or a GPU, then running large model tasks during idle time (e.g., while sleeping) can actually be cost-effective.
#local-llm #mac #qwen36 #glm52
Similar Articles
@ai_xiaomu: Here comes a full-featured multimodal local model that runs on a MacBook with 16GB: 1. Download LM Studio; 2. Search for Gemma 4 12B and install it; 3. Ask Codex to configure the local API parameters for you; 4. Then enjoy the freedom of tokens.
Guides users on running the Gemma 4 12B multimodal local model on a MacBook with 16GB RAM using LM Studio and Codex, enabling free token usage.
@nicekate8888: For the past twenty days, I've been obsessing over one thing — how to make Qwen3.6-27B run fast and well on my Mac. I started with Unsloth Q5, got 18 tok/s, and the fan was roaring. Then I switched to MLX 6bit + DFlash, hitting 22 tok/s, still not fast enough. Eventually I found MTPLX 4bit: 43 tok/s with good quality.
The user shares their experience optimizing Qwen3.6-27B inference speed on a Mac using different quantization methods (Unsloth Q5, MLX 6bit + DFlash, MTPLX 4bit), ultimately reaching 43 tok/s.
@sanbuphy: K2.6 successfully downloaded and deployed the Qwen3.5-0.8B model locally on a Mac, using the niche Zig language to implement and optimize inference, demonstrating the new model’s generalization ability. After 4,000+ tool calls and 12+ hours of continuous operation, K2.6 iterated 14 times…
K2.6 successfully downloaded and deployed the Qwen3.5-0.8B model locally on a Mac, using the niche Zig language to implement and optimize inference, demonstrating the new model’s generalization ability. After 4,000+ tool calls and 12+ hours of continuous operation, K2.6 iterated 14 times, boosting throughput from ~15 tokens/s to ~193 tokens/s, ultimately achieving 20% faster inference than LM Studio.
@berryxia: Damn, this directly steals Apple's thunder! A 6.6B small model shuts up Siri and a bunch of cloud giants, running locally on Mac with just 7GB of RAM. CJ Zafir's Mac-1 not only has ridiculously small parameters but also integrates 487 Mac-native tools, enabling chain calls, automatic reasoning, and more...
CJ Zafir's team has introduced Mac-1, a 6.6B-parameter small model that runs locally on Mac with only 7GB of RAM. It can chain-call 487 Mac-native tools, with an inference speed of 65 tok/s, aiming to disrupt the cloud-based large model-dominated Agent paradigm.
@jun_song: Best mid-range local LLM hardware : DGX Spark vs Mac Studio M5 Max 128GB (upcoming) Price: $4.7k (cheaper if used or OE…
A comparison of DGX Spark vs Mac Studio M5 Max for running local LLMs, highlighting decode speed, prefill performance, RAM, power consumption, and cost. The Mac wins on decode bandwidth but DGX is faster for prefill and supports batching.