@outsource_: NEW GLM+ QWEN 18B RUNS ON CONSUMER GPU IT BEATS 35B MoE AT HALF THE VRAM @KyleHessling1 just dropped the healed Qwopus-…
Summary
A new 18B merged quantized model, Qwopus-GLM-18B-GGUF, outperforms 35B MoE models while using half the VRAM and running on consumer GPUs.
View Cached Full Text
Cached at: 04/21/26, 10:32 AM
NEW GLM+ QWEN 18B RUNS ON CONSUMER GPU IT BEATS 35B MoE AT HALF THE VRAM @KyleHessling1 just dropped the healed Qwopus-GLM-18B-Merged-GGUF Insane 64-layer frankenmerge of two elite Qwen3.5-9B finetunes (Opus reasoning + GLM-5.1 distill). This thing is cooking on consumer
Similar Articles
24+ tok/s from ~30B MoE models on an old GTX 1080 (8 GB VRAM, 128k context)
A developer demonstrates running MoE models like Qwen 3.6 35B-A3B and Gemma 4 26B-A4B at 24+ tok/s on an old GTX 1080 (8GB VRAM) with 128k context using llama.cpp with MoE offloading and TurboQuant KV cache quantization, revealing optimization tricks for Gemma's MTP speculative decoding.
KyleHessling1/Qwopus-GLM-18B-Merged-GGUF
An experimental 18B-parameter model created by stacking two Qwen-3.5-9B finetunes and healing the layer boundary with 1000-step QLoRA; the resulting GGUF beats Qwen 3.6-35B MoE on a 44-test suite while fitting in 9.2 GB VRAM.
Jackrong/Qwopus-GLM-18B-Merged-GGUF
Jackrong released Qwopus-GLM-18B-Merged-GGUF, a 64-layer frankenmerge combining two Qwen3.5-9B finetunes into an ~18B parameter model, healed with 1000-step LoRA fine-tuning to fix layer boundary issues. The model achieves 90.9% on capability benchmarks while using less than half the VRAM of Qwen 3.6-35B MoE.
@sudoingX: anyone running a 16gb card, stop scrolling. @pupposandro and @davideciffa got qwen 35b-a3b down to 13.3gb, measured on …
A technique called luce spark allows Qwen 35B-a3B MoE model to run on a 16GB GPU (like RTX 3090) by learning which experts are frequently used and streaming the rest from RAM, achieving ~100 tok/s without VRAM bottleneck.
@cniongolo: I’m not sure people realize yet that you can actually run Qwen3.6-35B-A3B-Claude-4.7-Opus-abliterated-MTP-GGUF on a dua…
Demonstrates running a custom Qwen model (Qwen3.6-35B-A3B-Claude-4.7-Opus-abliterated-MTP-GGUF) on dual Nvidia RTX PRO 6000 Blackwell GPUs at 195 tokens per second using Hugging Face Inference.