8gb-vram

Tag

Cards List
#8gb-vram

@analogalok: Gemma 4 12B QAT (dense) achieves 1000+ tokens/sec prefill on 8GB VRAM with 120k context Gemma 4 12B QAT (dense), TurboQ…

X AI KOLs Following · 7h ago Cached

Gemma 4 12B QAT (dense) achieves over 1000 tokens per second prefill on an 8GB RTX 4060 with 120k context using TurboQuant, enabling full GPU layer offloading. This represents a 42% increase in prefill speed over previous methods.

0 favorites 0 likes
#8gb-vram

@analogalok: my 8 GB VRAM gaming laptop is absolutely going to hate me for this. but I still did it. ran a 31b dense model (Gemma 4 …

X AI KOLs Timeline · yesterday Cached

User runs Gemma 4 31B dense model on 8GB VRAM gaming laptop at ~3 tokens/sec using llama.cpp with MTP speculative decoding, demonstrating feasibility of running a 31B dense model on consumer hardware and proposing agentic workflows where a fast MoE model routes to this slower dense model for hard tasks.

0 favorites 0 likes
#8gb-vram

@VincentLogic: An entry-level laptop with 8GB VRAM can now run a fully autonomous AI Agent. Method: Gemma 4 26B + Hermes Desktop. Run the 26B model locally with just 8GB VRAM + 16GB RAM. What can it do after connecting Hermes? …

X AI KOLs Timeline · 2026-06-08 Cached

Introduces running a fully autonomous AI Agent on an entry-level laptop with 8GB VRAM using the Gemma 4 26B model and Hermes Desktop tool, enabling local file operations, code modification, web browsing, etc., significantly lowering the barrier for local Agents.

0 favorites 0 likes
#8gb-vram

Me train LLM on 8GB from Scratch. Me happy

Reddit r/LocalLLaMA · 2026-05-29

Built a repository to train a tiny language model (25M parameters) from scratch on 8GB VRAM, with support for MTP but noting limitations of mHC and BitNet.

0 favorites 0 likes
#8gb-vram

24+ tok/s from ~30B MoE models on an old GTX 1080 (8 GB VRAM, 128k context)

Reddit r/LocalLLaMA · 2026-05-13

A developer demonstrates running MoE models like Qwen 3.6 35B-A3B and Gemma 4 26B-A4B at 24+ tok/s on an old GTX 1080 (8GB VRAM) with 128k context using llama.cpp with MoE offloading and TurboQuant KV cache quantization, revealing optimization tricks for Gemma's MTP speculative decoding.

0 favorites 0 likes
← Back to home

Submit Feedback