@Xudong07452910: A hot comment section on Hacker News: Qwen 3.6 27B is the ideal choice for local development. Key findings: dense parameter model, native support for 256k context, running Q8_0 quantized version at 30 tokens/…

X AI KOLs Timeline Models

Summary

Qwen 3.6 27B is a dense 27B model that achieves impressive performance on local hardware with 256k context, running at 30 tokens/s on MacBook Max M5 and 50 tokens/s on RTX 5090, and is considered by some as the first local model with true general intelligence.

A hot comment section on Hacker News: Qwen 3.6 27B is the ideal choice for local development. Key findings: dense parameter model, native support for 256k context, running Q8_0 quantized version at 30 tokens/s on MacBook Max M5, and up to 50 tokens/s on RTX 5090. The author calls it 'the first local model with true general intelligence'. Previously, local models always had bottlenecks — this one handles creative writing and generating a hexagonal Minesweeper game from a single prompt, both quite well. Behind this assessment lies a larger trend: the engineering accumulation from years of refining large cloud models is beginning to scale down to local hardware. Privacy-first, controllable latency, offline availability — these three previously conflicting requirements can now be met simultaneously. The line of 'good enough local models' is approaching. If a 30B-level model can truly achieve general-purpose capability, then daily development and lightweight agent tasks without relying on API calls may be closer than most people think. https://quesma.com/blog/qwen-36-is-awesome/…
Original Article
View Cached Full Text

Cached at: 07/03/26, 08:32 AM

There was a post on Hacker News that blew up in the comments: Qwen 3.6 27B is the ideal choice for local development.

The core finding: a dense parameter model with native 256k context, running the Q8_0 quantized version at 30 tokens/s on a MacBook Max M5, and 50 tokens/s on an RTX 5090.

The author calls it “the first local model that actually makes sense as a general intelligence.” Previously, local models always had some weak link; this time, with a single prompt, it successfully handled both creative writing and generating a hexagonal minesweeper game — both tasks were done well.

Behind this assessment lies a larger trend: the engineering accumulation that cloud-based large models have refined over many years is now starting to scale down to local environments. Privacy control, low latency, and offline availability — these three needs that once required trade-offs — can now be satisfied simultaneously.

The “good enough local model” threshold is moving closer to us. If a 30B-class model can truly reach general‑purpose competence, relying on API calls for everyday development and lightweight agent tasks might be closer than most people think. https://quesma.com/blog/qwen-36-is-awesome/…


Qwen 3.6 27B is the sweet spot for local development

Source: https://quesma.com/blog/qwen-36-is-awesome/ I’ve been disappointed by local models in the past. But then I checked Qwen 3.6, and I was in awe. For me it’s the first local model that actually makes sense as a general intelligence.

It comes in two variants, a mixture-of-experts modelQwen 3.6 35B A3B (https://huggingface.co/Qwen/Qwen3.6-35B-A3B), and a denseQwen 3.6 27B (https://huggingface.co/Qwen/Qwen3.6-27B)- slower, but more powerful. The one I recommend!

Let me share my impressions, and show that you can run it too.

Thermal camera image

It’s hot, literally. When my knees started to melt, I grabbed a phone-attachedthermal camera (https://github.com/stared/thermal-upscale)and took a photo.

Qwen 3.6, rightfully,got a lot of coverage on Hacker News (https://hn.algolia.com/?dateEnd=1782305498&dateRange=custom&dateStart=1775001600&page=0&prefix=true&query=qwen&sort=byPopularity&type=story). The most common statement about Qwen 3.6 27B is that it punches above its weight - seeWill it Mythos? (https://swelljoe.com/post/will-it-mythos/). And I think it is a well-deserved sentiment. It will make your computer hot, but it’s worth it!

Testing the waters

Simon Willison uses “penguins on a bicycle” as a smoke test (see forQwen 3.6 35B A3B (https://simonwillison.net/2026/Apr/16/qwen-beats-opus/)and thenQwen 3.6 27B (https://simonwillison.net/2026/Apr/22/qwen36-27b/)). I usually go with constrained writing.

Chat about quantum mechanics with Qwen 3.6

A year ago these kinds of things were state of the art, needing a unique, and insanely expensive GPT-4.5, seevibe translating Quantum Flytrap (https://p.migdal.pl/blog/2025/04/vibe-translating-quantum-flytrap/).

I also asked it to write an 8 line poem about Zouk dance and quantum physics, seethe transcript (https://gist.github.com/stared/bac79cd053ea5443abcf58e622c083b7). The thought process made sense, both in terms of deliberation on quantum terms, and rhymes.

Then I asked in OpenCode to create a hexagonal minesweeper usingpnpm. It worked:

Hexagonal minesweeper in with Qwen 3.6 27B in OpenCode

It worked on the first go, from a single prompt, with a proper Node package. The mixture-of-experts Qwen 3.6 35B A3B was faster… but ignored my instruction to create a package, and did it in a singleindex\.html.

Real work

Sure, creative writing about quantum mechanics, or yet another clone of a minesweeper, is rarely a day job. But Qwen 3.6 27B is decent at regular tasks as well.

Maciej Cielecki’s candle-shop prompt running in OpenCode

Prompt by a friend,Maciej Cielecki (https://cielecki.com/), atAI Tinkerers Warsaw (https://poland.aitinkerers.org/).

It worked for a few minutes and created this:

A landing page by Qwen 3.6 27B

A landing page by Qwen 3.6 27B —view the live page (https://quesma.com/blog/qwen-36-is-awesome/qwen-landing-demo/).

By standards of current frontier models, it’s unremarkable. But it is already a practical job. It worked, was reactive, defaults were nice - all from a single, short prompt.

Running Qwen 3.6 locally with llama.cpp

Running local models is easier than ever. A few CLI lines and you’re off.

I recommendllama.cpp (https://github.com/ggml-org/llama.cpp)- a direct, open source tool that allows running models on various devices. You don’t need Ollama, and frankly -I would recommend against using that on ethical grounds (https://sleepingrobots.com/dreams/stop-using-ollama/).

First, we go to Hugging Face, to get proper quantization, i.e. a model with reduced size - popular ones are byunsloth (https://huggingface.co/unsloth/Qwen3.6-27B-GGUF)orbartowski (https://huggingface.co/bartowski/Qwen_Qwen3.6-27B-GGUF), among others. Default models usually come withBF16precision. A common 8-bit quantization saves half the space at almost no cost to quality. Going further down the road, models are smaller (and potentially - faster), but at the cost of quality, seethis comparison for 27B (https://www.reddit.com/r/LocalLLaMA/comments/1tr9vzn/qwen3627b_quantization_benchmark/)and another one for35B A3B (https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF/discussions/10).

We grabunsloth/Qwen3.6-27B-MTP-GGUF:Q8_0 (https://huggingface.co/unsloth/Qwen3.6-27B-MTP-GGUF), an 8-bit quantization with support for multi-token prediction (MTP).

llama-server -hf unsloth/Qwen3.6-27B-MTP-GGUF:Q8_0 \ --spec-type draft-mtp -ngl 999 -fa on -c 65536 --port 8080

What it does:

  • \-hf unsloth/Qwen3\.6\-27B\-MTP\-GGUF:Q8\_0grabs from Hugging Face, on the next runs will reuse that
  • \-m ~/models/Qwen3\.6\-27B\-Q8\_0\.ggufuse instead if you already have it
  • draft\-mtpwe use a fast model to predict subsequent tokens, speeds up things
  • \-ngl 999for putting all layers to GPU
  • \-fa onflash attention is on
  • \-c 65536context size set to 64k tokens (this we can tweak, as Qwen 3.6 27B native context is 256k)
  • \-\-port 8080better to pin port, as it will be used by other configs

If you openhttp://127\.0\.0\.1:8080, you can directly chat with it.

Precisely the same server can be used for vibe coding. Choice of agent depends both on one’s goal and subjective taste - for an all-around OpenCode, minimalistic Pi, and self-improving Hermes.

For OpenCode, it is as simple as adding to~/\.config/opencode/opencode\.jsonc:

{ "$schema": "https://opencode.ai/config.json", "provider": { "llama": { "name": "llama.cpp (local)", "npm": "@ai-sdk/openai-compatible", "options": { "baseURL": "http://127.0.0.1:8080/v1", "apiKey": "local" }, "models": { "qwen3.6-27b": { "name": "Qwen3.6-27B Q8 +MTP" } } } }, "model": "llama/qwen3.6-27b" }

If you just want to chat and are a big fan of Terminal, instead ofllama\-serverusellama\-cli:

llama-cli -hf unsloth/Qwen3.6-27B-MTP-GGUF:Q8_0 \ -ngl 999 -fa on -c 65536

Measuring performance

Is it fast enough?

I ran a few tests (source is here (https://github.com/stared/benching-local-llms-on-apple-silicon)) on my Macbook Max M5 128 GB, running it with and without multi-token prediction, and comparing both with the 35B A3B model, and also a quantized DeepSeek V4 Flash versionDwarfStar4 (https://github.com/antirez/ds4).

DeepSeek-V4-Flash· Q2–Q4

30 tokens per second is not bad,well within typical frontier model API range (https://openrouter.ai/openai/gpt-5.5#performance). Whilemlx-lm (https://github.com/ml-explore/mlx-lm)is precisely targeted at Apple Silicon devices, and AI agents heavily recommend it, llama.cpp turned out to be faster. It was using 95% of GPU, which means it is efficiently using available resources.

Macbook Max M5 is a beast (at least for a laptop), but on other devices it should also work decently. As you can see, both Qwen 3.6 variants run within 48 GB of Apple Silicon’s shared RAM. A 4-bit quantization are less than 18 GB and should run on 32 GB device. On consumer Nvidia RTX cards, you need to quantize aggressively, but inference runs even faster.

I set this up today on my 5090 at Q6_K quantization and Q4_0 KV, got 50 tokens/s consistently at 123k context, using ~28/32gb vram through LM Studio. -gfosco on the Hacker News (https://news.ycombinator.com/item?id=47871338)

While 35B A3B is 3x faster, I prefer 27B. I’d rather generate a third as much code, but of higher quality.

How do they relate to previous state of the art models?

Manual inspection is great, but benchmarks help with grounding intuitions. Here is the score fromArtificial Analysis (https://artificialanalysis.ai/), comparing it with frontier models:

Gemma 4 31B

≈ late 2024

o1 / Claude 3.5 Sonnet

Qwen3.6-35B-A3B

≈ early 2025

o3 / Claude 4 Sonnet

Qwen3.6-27B

≈ mid 2025

GPT-5 / Claude Sonnet 4.5

DeepSeek-V4-Flash

≈ late 2025

GPT-5.2 / Claude Opus 4.5

A few more benchmarks are inthese notes (https://github.com/stared/benching-local-llms-on-apple-silicon), but the spirit is similar. Added hereGemma 4 31B (https://deepmind.google/models/gemma/gemma-4/), as a lot of people use this as the default for local coding. But both benchmarks and general sentiment online favour Qwen 3.6 27B by a large margin.

Here there is a caveat - 8-bit quantization of Qwen 3.6 likely does not affect results much, but DwarfStar4 uses much more aggressive ones for DeepSeek V4 Flash, 2-4 bit. For sure it is worse than the full model. My personal impression is that within these quantizations Qwen 3.6 27B is as good as (or maybe slightly better than) DwarfStar4. Though, I won’t be surprised if for longer context projects DS4 has an edge.

What’s next

I think we are entering a fascinating era, when it becomes feasible to run one’s own models.

The change will be propelled further by the state of proprietary frontier models. Claude Fable 5 was taken down. Other frontier models run at a massive subsidy, where paying $100 a month gives us thousands worth in tokens. Let’s use the discount while it lasts!

A locally set model can be fine-tuned to our needs, and cannot be taken away. Businesses can use them for proprietary and sensitive data. We can use them personally for offline projects, or when we don’t feel comfortable sharing our deepest secrets, or medical data, with the US or China.

With the release offrontier-level open-weight GLM 5.2 (https://artificialanalysis.ai/articles/glm-5-2-is-the-new-leading-open-weights-model-on-the-artificial-analysis-intelligence-index), there is a new era. While Qwen 3.6 was the stepping stone, even frontierGLM 5.2 can be run locally (https://unsloth.ai/docs/models/glm-5.2). It won’t run on your Macbook or a single RTX 5090. But still, it is manageable with a company budget.

Moreover, I strongly believe that we will have models smarter than current state of the art, while runnable on local devices, maybe even smartphones. Current models combine both raw intelligence and factual knowledge in the same weights. Future models will likely separate that, offloading a lot of knowledge to tool calling.

Discuss onHacker News (https://news.ycombinator.com/item?id=48721903),LinkedIn (https://www.linkedin.com/feed/update/urn:li:activity:7477477713050685440/), orX (https://x.com/pmigdal/status/2071641659289330168).

Similar Articles

Qwen 3.6 27B is the sweet spot for local development

Hacker News Top

Qwen 3.6 27B is praised as a powerful local AI model that outperforms expectations for general intelligence, suitable for practical tasks like code generation, and runs easily with llama.cpp.

@sanbuphy: K2.6 successfully downloaded and deployed the Qwen3.5-0.8B model locally on a Mac, using the niche Zig language to implement and optimize inference, demonstrating the new model’s generalization ability. After 4,000+ tool calls and 12+ hours of continuous operation, K2.6 iterated 14 times…

X AI KOLs Timeline

K2.6 successfully downloaded and deployed the Qwen3.5-0.8B model locally on a Mac, using the niche Zig language to implement and optimize inference, demonstrating the new model’s generalization ability. After 4,000+ tool calls and 12+ hours of continuous operation, K2.6 iterated 14 times, boosting throughput from ~15 tokens/s to ~193 tokens/s, ultimately achieving 20% faster inference than LM Studio.

Qwen 3.6 27B on DeepSWE

Reddit r/LocalLLaMA

Qwen 3.6 27B scored 2% on the DeepSWE benchmark, placing 18/20 above Haiku 4.5 and Minimax M2.7, highlighting the gap between local and leading-edge models.

Running Qwen3.6 35b a3b on 8gb vram and 32gb ram ~190k context

Reddit r/LocalLLaMA

The author shares a high-performance local inference configuration for running Qwen3.6 35B A3B on limited hardware (8GB VRAM, 32GB RAM) using a modified llama.cpp with TurboQuant support, achieving ~37-51 tok/sec with ~190k context.