@karminski3: Thinking of buying a Mac to run large models? This is a deterrent post. Actually, the estimation method is simple. Even if you buy a MacStudio to run the Qwen3.6-27B 4bit quantized version, then enable DFlash to use Qwen's built-in speculative decoding, it only reaches 65 token/s. And now most large models can run at 40 token/s…

X AI KOLs Timeline News

Summary

The author calculates the token cost and break-even period of running large models on a Mac Studio, concluding that it is not cost-effective for ordinary users to buy a Mac for personal large model use, and suggests that using APIs or renting GPUs is more economical.

Thinking of buying a Mac to run large models? This is a deterrent post. Actually, the estimation method is simple. Even if you buy a MacStudio to run the Qwen3.6-27B 4bit quantized version, then enable DFlash to use Qwen's built-in speculative decoding, it only reaches 65 tokens/s. And now most large models can run at 40 tokens/s. If you specifically buy a MacStudio M3 Ultra 96G to run large models, and convert the device price (32,999) into API costs, taking GLM-5.2 as an example, at 28 RMB per million tokens, the price of one MacStudio can buy about 32,999 / 28 = 1,178 million tokens. To output these tokens, the MacStudio running Qwen3.6-27B would need to run continuously for 209 days. That means the break-even period is at least 200 days of uninterrupted operation. Only after that, running models is pure profit. This doesn't even include electricity costs or the fact that you could buy API packages instead of pay-as-you-go. And most importantly, this is only for a small 27B model. If you actually buy a 512G MacStudio (108,749, and it seems to be out of stock), and run a quantized version of GLM-5.2, the speed drops to only 17 tokens/s, and the break-even period is about 7 years... Given that new model versions are released every 1.5 months, it's definitely not cost-effective for ordinary users to buy for personal use. So for most users, buying a coding plan is more cost-effective. If, like me, you need to test new models, renting GPUs directly is much more economical than buying. Of course, if you already own a Mac or GPU, then letting it run large model tasks during idle time (e.g., while sleeping) is actually cost-effective. #LocalLargeModels #mac #qwen36 #glm52
Original Article
View Cached Full Text

Cached at: 06/22/26, 01:45 PM

Want to buy a Mac to run large models? This is a buyer’s remorse post.

Actually, the calculation is quite simple. Even if you buy a Mac Studio today to run Qwen3.6-27B 4-bit quantized version, enable DFlash with Qwen’s built-in speculative decoding, you’ll only get about 65 tokens/s. But nowadays, most large models can already reach 40 tokens/s.

If you specifically buy a Mac Studio M3 Ultra (96GB) to run large models, converting the device price (¥32,999) into API usage — take GLM-5.2 as an example, ¥28 per million tokens — one Mac Studio’s price gets you roughly 32,999 / 28 = 1,178 million tokens.

And to output those tokens, a Mac Studio running Qwen3.6-27B would need to run continuously for 209 days. That means your break-even period is at least 200 days of non-stop operation. Only after that does running models become pure profit.

This doesn’t even account for electricity costs, or buying API packages instead of pay-as-you-go. And most importantly, this is only for a small 27B model.

If you were to buy a 512GB Mac Studio (¥108,749 — and it seems to be out of stock anyway), then run a quantized version of GLM-5.2, the speed drops to only 17 tokens/s, and the break-even period becomes about 7 years…

Given that new model versions are released every 1.5 months, buying for personal use is absolutely not cost-effective. So for most users, buying a coding plan is much more worthwhile. If, like me, you need to test new models, renting GPU directly is also far better than buying.

Of course, if you already own a Mac or a GPU, then running large model tasks during idle time (e.g., while sleeping) can actually be cost-effective.

#local-llm #mac #qwen36 #glm52

Similar Articles

@nicekate8888: For the past twenty days, I've been obsessing over one thing — how to make Qwen3.6-27B run fast and well on my Mac. I started with Unsloth Q5, got 18 tok/s, and the fan was roaring. Then I switched to MLX 6bit + DFlash, hitting 22 tok/s, still not fast enough. Eventually I found MTPLX 4bit: 43 tok/s with good quality.

X AI KOLs Timeline

The user shares their experience optimizing Qwen3.6-27B inference speed on a Mac using different quantization methods (Unsloth Q5, MLX 6bit + DFlash, MTPLX 4bit), ultimately reaching 43 tok/s.

@sanbuphy: K2.6 successfully downloaded and deployed the Qwen3.5-0.8B model locally on a Mac, using the niche Zig language to implement and optimize inference, demonstrating the new model’s generalization ability. After 4,000+ tool calls and 12+ hours of continuous operation, K2.6 iterated 14 times…

X AI KOLs Timeline

K2.6 successfully downloaded and deployed the Qwen3.5-0.8B model locally on a Mac, using the niche Zig language to implement and optimize inference, demonstrating the new model’s generalization ability. After 4,000+ tool calls and 12+ hours of continuous operation, K2.6 iterated 14 times, boosting throughput from ~15 tokens/s to ~193 tokens/s, ultimately achieving 20% faster inference than LM Studio.

@berryxia: Damn, this directly steals Apple's thunder! A 6.6B small model shuts up Siri and a bunch of cloud giants, running locally on Mac with just 7GB of RAM. CJ Zafir's Mac-1 not only has ridiculously small parameters but also integrates 487 Mac-native tools, enabling chain calls, automatic reasoning, and more...

X AI KOLs Timeline

CJ Zafir's team has introduced Mac-1, a 6.6B-parameter small model that runs locally on Mac with only 7GB of RAM. It can chain-call 487 Mac-native tools, with an inference speed of 65 tok/s, aiming to disrupt the cloud-based large model-dominated Agent paradigm.