Floor for local meeting summarization on a 6GB GPU: qwen3.5:0.8b works at 57s, Granite 4 350M hallucinates

Reddit r/LocalLLaMA 05/19/26, 04:50 PM Tools

open-source meeting-summarization local-llm whisper benchmarking small-models gpu

Summary

The author introduces VoiceFlow, an open-source local dictation and meeting transcription tool, and benchmarks small LLMs (qwen3.5:0.8b and Granite 4 350M) for meeting summarization on a 6GB GPU, finding the 0.8B Qwen viable while sub-500M models hallucinate. They also ask the community for long-context summarization solutions on low VRAM.

Disclosure: I made this. Open-source, MIT, Windows + Linux. Not affiliated with [voiceflow.com](http://voiceflow.com) (the chatbot SaaS, name collision, sorry). Why this exists: I wanted local-only dictation and meeting transcription, because audio shouldn't have to leave the machine just to become text. I had a 6GB GPU sitting there doing nothing most of the day. So I built it: hold a hotkey, faster-whisper transcribes locally, text pastes at the cursor. v1.6.0 shipped today and adds the meetings recorder: mic + system audio into one stereo file, transcribed locally, summary goes through whatever endpoint you point it at (Ollama, llama.cpp, Groq, OpenAI). The only network call in the whole product is the optional summary, and you pick where it goes. The on-topic part for this sub: mini models on real workloads. v1.6.0 was the excuse to actually benchmark this on real meeting transcripts instead of toy prompts. I tried the latest small Qwen first, qwen3.5:0.8b (873M, Q8\_0). Test rig: RTX 3060 Laptop 6GB, \~4.3GB free after Whisper loads, Ollama 0.23, Arch. Input: a real 4-minute meeting, \~2900 chars. It works, with one caveat. Ollama's VRAM-aware default num\_ctx on this GPU is 4096, and on a reasoning model with thinking-on-by-default that gets eaten before the user-visible tokens land. One-line Modelfile fix: FROM qwen3.5:0.8b PARAMETER num\_ctx 16384 After that it streamed a 1562-char structured summary in 57 seconds at 2.2GB of VRAM. TL;DR, decisions, action items, open questions, all there. Better than I'd expect from sub-1B honestly. For the "but you didn't go small enough" counter: I sanity-checked Granite 4.0 350M on the same workload. Speed-wise it crushed (0.6 to 2.8 seconds per summary vs 57s for the 0.8B Qwen) and structure came back clean, sections all in the right places. Then I read the output. On a transcript about Anthropic acquiring Bun, Granite returned "Anthropic's acquisition by Anthropic" and invented Binance as a discussion topic. A different 4-minute meeting came back as a Star Trek bridge log ("Starship Cassiopeia", "Tao City F", colony vessel Andromeda). Keywords matched, relationships scrambled. So qwen3.5:0.8b-vf is the working floor for me, I haven't seen anything coherent come out of sub-500M on real conversation data yet, open to being shown wrong. For people who don't want to run local: Groq's free tier on llama-3.3-70b has been solid. \~2 seconds per summary, output is tighter than the local 0.8B, and the only thing that broke it for me was a 4-hour meeting transcript that blew past their context window. For anything under that, it's a real free option. The actual question I'd like answers on, since this is the sub that knows: long-context structured summarization on low VRAM. The 0.8B Qwen handles a 4-minute meeting comfortably at 16K context. For 1-2 hour transcripts (\~30K-60K tokens) on a 6-8GB GPU, what's working? Pushing context wider and eating the VRAM, chunked map-reduce, or a different small model that doesn't fall apart on long inputs. Looking for something that holds structure (TL;DR + sections + bullets) when the input gets long, without needing 24GB of VRAM to do it. App: one .exe on Windows, one .AppImage on Linux. Pyloid + React + faster-whisper + SQLite, CUDA auto-detect with CPU fallback. Model + mic + hotkey done in onboarding in about a minute. Claude was the pair-programming assistant for a lot of boilerplate and the Qt threading gnarliness; architecture and the hard bugs are mine, git history is honest about it. Repo + 1.6.0: [https://github.com/infiniV/VoiceFlow](https://github.com/infiniV/VoiceFlow) [https://github.com/infiniV/VoiceFlow/releases/tag/v1.6.0](https://github.com/infiniV/VoiceFlow/releases/tag/v1.6.0) Web: [https://get-voice-flow.vercel.app/](https://get-voice-flow.vercel.app/) Mostly want to hear answers. Star if it works for you, but a bug report in Issues is more useful.

Original Article

Floor for local meeting summarization on a 6GB GPU: qwen3.5:0.8b works at 57s, Granite 4 350M hallucinates

Similar Articles

Running Qwen3.6 35b a3b on 8gb vram and 32gb ram ~190k context

High VRAM local coding model — still Qwen 3.6 27B?

Got local Qwen 3.5/3.6 generating meeting summaries entirely offline on an M4 Max. Demo with Wi-Fi off. This is the future.

Running Qwen3.6-35B-A3B on a laptop RTX 4060 (8GB) — what worked, what didn't, and a surprising speculative-decoding result

Qwen3.6 35B MoE on 8GB VRAM — working llama-server config + a max_tokens / thinking trap I ran into

Submit Feedback

Similar Articles

Running Qwen3.6 35b a3b on 8gb vram and 32gb ram ~190k context

High VRAM local coding model — still Qwen 3.6 27B?

Got local Qwen 3.5/3.6 generating meeting summaries entirely offline on an M4 Max. Demo with Wi-Fi off. This is the future.

Running Qwen3.6-35B-A3B on a laptop RTX 4060 (8GB) — what worked, what didn't, and a surprising speculative-decoding result

Qwen3.6 35B MoE on 8GB VRAM — working llama-server config + a max_tokens / thinking trap I ran into