@garrytan: Downloading now... 1M token context window with supposedly usable coding agent capability all on a 128GB Macbook Pro is
Summary
Garry Tan highlights a model with a 1M token context window and coding agent capabilities running locally on a 128GB MacBook Pro, expressing excitement about the milestone.
View Cached Full Text
Cached at: 05/09/26, 07:42 AM
Downloading now… 1M token context window with supposedly usable coding agent capability all on a 128GB Macbook Pro is 🤯 https://t.co/otTL8NZMvV
Similar Articles
@0xSero: Locally Part 1 - Apple Silicon Macs give you large pools of memory to run big models, but the token generation speed wi…
Apple Silicon Macs offer large memory pools for running big models but with slower token generation, performing best with large MoEs that have low active parameters.
@rohanpaul_ai: atomic[.]chat just made Gemma 4 26B faster inside LLaMA.cpp. making token generation about 40% faster in its MacBook Pr…
atomic.chat has optimized Gemma 4 26B inference in LLaMA.cpp, achieving ~40% faster token generation on MacBook Pro M5 Max using Multi-Token Prediction (MTP) speculative decoding. This is a notable win for local AI users running desktop apps, coding agents, and private on-device assistants.
Is anyone getting real coding work done with Qwen3.6-35B-A3B-UD-Q4_K_M on a 32GB Mac in opencode, claude code or similar?
A user shares their experience running Qwen3-35B-A3B quantized model on an M2 MacBook Pro with 32GB RAM for coding tasks via opencode and llama.cpp, finding that the 32K context window limit causes critical memory loss during compaction, making complex coding tasks impractical. They conclude that meaningful agentic coding with this model likely requires at least 128K context, exceeding what their hardware can support.
2x 512gb ram M3 Ultra mac studios
A user shares their $25k hardware setup of two 512GB RAM M3 Ultra Mac Studios for running large language models locally, having tested DeepSeek V3 Q8 and GLM 5.1 Q4 via the exo distributed inference backend, while awaiting Kimi 2.6 MLX optimization.
@0xSero: GLM-5.1-478B-NVFP4 Running on: - 4x RTX Pro 6000 - Sglang - 370,000 max tokens (1.75x full context) - p10 27.7 | p90 45…
A quantized 478B-parameter GLM-5.1 model runs on 4×RTX Pro 6000 GPUs via SGLang, delivering 370k-token context at up to 45 tok/s decode and 1340 tok/s prefill, and is demoed driving Figma.