@karminski3: Thinking of buying a Mac to run large models? This is a deterrent post. Actually, the estimation method is simple. Even if you buy a MacStudio to run the Qwen3.6-27B 4bit quantized version, then enable DFlash to use Qwen's built-in speculative decoding, it only reaches 65 token/s. And now most large models can run at 40 token/s…

X AI KOLs Timeline 06/22/26, 08:28 AM News

mac-studio local-llm cost-analysis qwen glm self-hosting api-vs-local

Summary

The author calculates the token cost and break-even period of running large models on a Mac Studio, concluding that it is not cost-effective for ordinary users to buy a Mac for personal large model use, and suggests that using APIs or renting GPUs is more economical.

Thinking of buying a Mac to run large models? This is a deterrent post. Actually, the estimation method is simple. Even if you buy a MacStudio to run the Qwen3.6-27B 4bit quantized version, then enable DFlash to use Qwen's built-in speculative decoding, it only reaches 65 tokens/s. And now most large models can run at 40 tokens/s. If you specifically buy a MacStudio M3 Ultra 96G to run large models, and convert the device price (32,999) into API costs, taking GLM-5.2 as an example, at 28 RMB per million tokens, the price of one MacStudio can buy about 32,999 / 28 = 1,178 million tokens. To output these tokens, the MacStudio running Qwen3.6-27B would need to run continuously for 209 days. That means the break-even period is at least 200 days of uninterrupted operation. Only after that, running models is pure profit. This doesn't even include electricity costs or the fact that you could buy API packages instead of pay-as-you-go. And most importantly, this is only for a small 27B model. If you actually buy a 512G MacStudio (108,749, and it seems to be out of stock), and run a quantized version of GLM-5.2, the speed drops to only 17 tokens/s, and the break-even period is about 7 years... Given that new model versions are released every 1.5 months, it's definitely not cost-effective for ordinary users to buy for personal use. So for most users, buying a coding plan is more cost-effective. If, like me, you need to test new models, renting GPUs directly is much more economical than buying. Of course, if you already own a Mac or GPU, then letting it run large model tasks during idle time (e.g., while sleeping) is actually cost-effective. #LocalLargeModels #mac #qwen36 #glm52

Original Article

View Cached Full Text

Cached at: 06/22/26, 01:45 PM

Want to buy a Mac to run large models? This is a buyer’s remorse post.

Actually, the calculation is quite simple. Even if you buy a Mac Studio today to run Qwen3.6-27B 4-bit quantized version, enable DFlash with Qwen’s built-in speculative decoding, you’ll only get about 65 tokens/s. But nowadays, most large models can already reach 40 tokens/s.

If you specifically buy a Mac Studio M3 Ultra (96GB) to run large models, converting the device price (¥32,999) into API usage — take GLM-5.2 as an example, ¥28 per million tokens — one Mac Studio’s price gets you roughly 32,999 / 28 = 1,178 million tokens.

And to output those tokens, a Mac Studio running Qwen3.6-27B would need to run continuously for 209 days. That means your break-even period is at least 200 days of non-stop operation. Only after that does running models become pure profit.

This doesn’t even account for electricity costs, or buying API packages instead of pay-as-you-go. And most importantly, this is only for a small 27B model.

If you were to buy a 512GB Mac Studio (¥108,749 — and it seems to be out of stock anyway), then run a quantized version of GLM-5.2, the speed drops to only 17 tokens/s, and the break-even period becomes about 7 years…

Given that new model versions are released every 1.5 months, buying for personal use is absolutely not cost-effective. So for most users, buying a coding plan is much more worthwhile. If, like me, you need to test new models, renting GPU directly is also far better than buying.

Of course, if you already own a Mac or a GPU, then running large model tasks during idle time (e.g., while sleeping) can actually be cost-effective.

#local-llm #mac #qwen36 #glm52

Similar Articles

@ai_xiaomu: Here comes a full-featured multimodal local model that runs on a MacBook with 16GB: 1. Download LM Studio; 2. Search for Gemma 4 12B and install it; 3. Ask Codex to configure the local API parameters for you; 4. Then enjoy the freedom of tokens.

@jun_song: Best mid-range local LLM hardware : DGX Spark vs Mac Studio M5 Max 128GB (upcoming) Price: $4.7k (cheaper if used or OE…

Submit Feedback

Similar Articles

@ai_xiaomu: Here comes a full-featured multimodal local model that runs on a MacBook with 16GB: 1. Download LM Studio; 2. Search for Gemma 4 12B and install it; 3. Ask Codex to configure the local API parameters for you; 4. Then enjoy the freedom of tokens.
Guides users on running the Gemma 4 12B multimodal local model on a MacBook with 16GB RAM using LM Studio and Codex, enabling free token usage.

@nicekate8888: For the past twenty days, I've been obsessing over one thing — how to make Qwen3.6-27B run fast and well on my Mac. I started with Unsloth Q5, got 18 tok/s, and the fan was roaring. Then I switched to MLX 6bit + DFlash, hitting 22 tok/s, still not fast enough. Eventually I found MTPLX 4bit: 43 tok/s with good quality.

@sanbuphy: K2.6 successfully downloaded and deployed the Qwen3.5-0.8B model locally on a Mac, using the niche Zig language to implement and optimize inference, demonstrating the new model’s generalization ability. After 4,000+ tool calls and 12+ hours of continuous operation, K2.6 iterated 14 times…

@berryxia: Damn, this directly steals Apple's thunder! A 6.6B small model shuts up Siri and a bunch of cloud giants, running locally on Mac with just 7GB of RAM. CJ Zafir's Mac-1 not only has ridiculously small parameters but also integrates 487 Mac-native tools, enabling chain calls, automatic reasoning, and more...

@jun_song: Best mid-range local LLM hardware : DGX Spark vs Mac Studio M5 Max 128GB (upcoming) Price: $4.7k (cheaper if used or OE…