Tag
An uncensored version of the GLM5.2 754B parameter model (231GB GGUF) was successfully deployed on a Mac Studio M3 Ultra with 512GB RAM, achieving approximately 3.6 tokens/s.
Antirez announces high probability of merging a branch implementing GLM 5.2 in DwarfStar, which could become the best model for 512GB Mac Studio and potentially run on distributed 128GB MacBooks with 2-bit quantization.
GLM 5.2 delivers major performance gains on Mac Studio with 512GB RAM, achieving prefill speeds above 100 t/s at high context lengths and enabling 4-bit quantization for contexts over 100k tokens, as detailed in a pull request by the oMLX creator.
The author calculates the token cost and break-even period of running large models on a Mac Studio, concluding that it is not cost-effective for ordinary users to buy a Mac for personal large model use, and suggests that using APIs or renting GPUs is more economical.
The author states they will immediately order an M5 Ultra Mac Studio with max RAM if Apple releases it soon, citing the M3 Ultra's high resale value and the M5's inference performance leap as reasons.
A user reports running GLM 5.2 locally on a Mac Studio with 2-bit quantization, claiming it outperforms Opus 4.8 and enables free, private superintelligence for coding and agent tasks.
GLM 5.2, an open-weight AI model comparable to top closed models, has been released and is now running on MLX on two Mac Studios (M3 Ultra).
The author compares various GPUs for LLM inference, critiquing common benchmarks and emphasizing the importance of prefill performance over generation speed, offering recommendations for different budgets and use cases.
AWS has secured a large number of Apple's M3 Ultra Mac Studio units for cloud services, while regular consumers face continued shortages and limited availability.
A 10-year-old blogger shared his understanding of the AI era, believing Tokens are hard currency, and runs multiple AI Agents working together.
Antirez reports that DeepSeek v4 PRO runs well on a Mac Studio M3 Ultra with 512GB RAM using 2-bit quantization, achieving 130 t/s prefill and 13 t/s generation.
A 10-year-old in China uses a Mac Studio to run multiple AI agents, highlighting the emergence of AI-native children who understand tokens and automation.
A comparison of DGX Spark vs Mac Studio M5 Max for running local LLMs, highlighting decode speed, prefill performance, RAM, power consumption, and cost. The Mac wins on decode bandwidth but DGX is faster for prefill and supports batching.
DS4 is a specialized inference engine by antirez designed to run DeepSeek V4 Flash locally on high-end Mac hardware, featuring optimized KV cache handling and 1M context support.
Apple has removed the 256GB M3 Ultra Mac Studio configuration from its online store, raising speculation about future storage options for upcoming models.
The article argues that the Mac Studio is a poor choice for 24/7 local AI workflows due to the lack of CUDA support and non-upgradable hardware, despite its large unified memory.
The author shares a synthesized buying guide for hardware suitable for running local LLMs, comparing Mac Studio, NVIDIA, and AMD options based on community feedback.
A user shared their personal local LLM stack running on a MacStudio M2 Ultra 64 GB, combining SuperQwen3.6-35b-mlx-4bit, Ernie Image Turbo, and multiple helper models for coding and chat.
Bloomberg reports that new Mac Studio models won't arrive until at least October 2026, raising questions about when Apple hardware will be capable of running models like DeepSeek v4.