Tag
Google's new Gemma 4 12B is a single decoder-only transformer with encoder-free multimodal input, achieving strong benchmarks while being small enough to run locally on a budget GPU. It is released under Apache 2.0 license.
The author presents a small transformer-based neural network trained from scratch to turn any image into a playable game, running in real-time on consumer GPUs like an RTX 5090. The model uses autoregressive decoding with KV caching but currently has issues with motion and context.
The article discusses how the Qwen3.6-35B-A3B model exhibits different failure modes when used as a sub-agent under an orchestrator compared to solo use, particularly due to its MoE architecture and the lack of validation layers, leading to undetected errors.
The author details attempts to locally train a Qwen 3.6 27B autoregressive-to-diffusion model on an Nvidia 5090 GPU using qlora and modifications from open-dllm and d3LLM, facing VRAM constraints and hardware issues while exploring one-shot diffusion techniques.
Qwen 3.6 27B runs fast on 16 GB VRAM thanks to 'Pure Quant' technology, achieving 40 tokens/s with MTP and supporting 64k contexts, enabling local AI on consumer GPUs like RTX 4060 Ti.
A hobbyist trained a 235M-parameter LLM from scratch on a single RTX 5080, sharing full PyTorch pipeline and open-sourcing Plasma 1.0.
A new 18B merged quantized model, Qwopus-GLM-18B-GGUF, outperforms 35B MoE models while using half the VRAM and running on consumer GPUs.