The author introduces VoiceFlow, an open-source local dictation and meeting transcription tool, and benchmarks small LLMs (qwen3.5:0.8b and Granite 4 350M) for meeting summarization on a 6GB GPU, finding the 0.8B Qwen viable while sub-500M models hallucinate. They also ask the community for long-context summarization solutions on low VRAM.
Disclosure: I made this. Open-source, MIT, Windows + Linux. Not affiliated with [voiceflow.com](http://voiceflow.com) (the chatbot SaaS, name collision, sorry). Why this exists: I wanted local-only dictation and meeting transcription, because audio shouldn't have to leave the machine just to become text. I had a 6GB GPU sitting there doing nothing most of the day. So I built it: hold a hotkey, faster-whisper transcribes locally, text pastes at the cursor. v1.6.0 shipped today and adds the meetings recorder: mic + system audio into one stereo file, transcribed locally, summary goes through whatever endpoint you point it at (Ollama, llama.cpp, Groq, OpenAI). The only network call in the whole product is the optional summary, and you pick where it goes. The on-topic part for this sub: mini models on real workloads. v1.6.0 was the excuse to actually benchmark this on real meeting transcripts instead of toy prompts. I tried the latest small Qwen first, qwen3.5:0.8b (873M, Q8\_0). Test rig: RTX 3060 Laptop 6GB, \~4.3GB free after Whisper loads, Ollama 0.23, Arch. Input: a real 4-minute meeting, \~2900 chars. It works, with one caveat. Ollama's VRAM-aware default num\_ctx on this GPU is 4096, and on a reasoning model with thinking-on-by-default that gets eaten before the user-visible tokens land. One-line Modelfile fix: FROM qwen3.5:0.8b PARAMETER num\_ctx 16384 After that it streamed a 1562-char structured summary in 57 seconds at 2.2GB of VRAM. TL;DR, decisions, action items, open questions, all there. Better than I'd expect from sub-1B honestly. For the "but you didn't go small enough" counter: I sanity-checked Granite 4.0 350M on the same workload. Speed-wise it crushed (0.6 to 2.8 seconds per summary vs 57s for the 0.8B Qwen) and structure came back clean, sections all in the right places. Then I read the output. On a transcript about Anthropic acquiring Bun, Granite returned "Anthropic's acquisition by Anthropic" and invented Binance as a discussion topic. A different 4-minute meeting came back as a Star Trek bridge log ("Starship Cassiopeia", "Tao City F", colony vessel Andromeda). Keywords matched, relationships scrambled. So qwen3.5:0.8b-vf is the working floor for me, I haven't seen anything coherent come out of sub-500M on real conversation data yet, open to being shown wrong. For people who don't want to run local: Groq's free tier on llama-3.3-70b has been solid. \~2 seconds per summary, output is tighter than the local 0.8B, and the only thing that broke it for me was a 4-hour meeting transcript that blew past their context window. For anything under that, it's a real free option. The actual question I'd like answers on, since this is the sub that knows: long-context structured summarization on low VRAM. The 0.8B Qwen handles a 4-minute meeting comfortably at 16K context. For 1-2 hour transcripts (\~30K-60K tokens) on a 6-8GB GPU, what's working? Pushing context wider and eating the VRAM, chunked map-reduce, or a different small model that doesn't fall apart on long inputs. Looking for something that holds structure (TL;DR + sections + bullets) when the input gets long, without needing 24GB of VRAM to do it. App: one .exe on Windows, one .AppImage on Linux. Pyloid + React + faster-whisper + SQLite, CUDA auto-detect with CPU fallback. Model + mic + hotkey done in onboarding in about a minute. Claude was the pair-programming assistant for a lot of boilerplate and the Qt threading gnarliness; architecture and the hard bugs are mine, git history is honest about it. Repo + 1.6.0: [https://github.com/infiniV/VoiceFlow](https://github.com/infiniV/VoiceFlow) [https://github.com/infiniV/VoiceFlow/releases/tag/v1.6.0](https://github.com/infiniV/VoiceFlow/releases/tag/v1.6.0) Web: [https://get-voice-flow.vercel.app/](https://get-voice-flow.vercel.app/) Mostly want to hear answers. Star if it works for you, but a bug report in Issues is more useful.
The author shares a high-performance local inference configuration for running Qwen3.6 35B A3B on limited hardware (8GB VRAM, 32GB RAM) using a modified llama.cpp with TurboQuant support, achieving ~37-51 tok/sec with ~190k context.
The user discusses their experience with Qwen 3.6 27B for local coding tasks and asks for recommendations for larger models (100B+) suitable for systems with 224GB of VRAM.
The Hedy meeting app now supports fully offline AI summaries using local models like Qwen and Gemma via llama.cpp, with options for bring-your-own-model and hardware-aware model selection. The update enables Wi-Fi-free operation on Apple Silicon and Windows GPUs, though cloud still offers higher speed and quality.
A detailed account of running the Qwen3.6-35B-A3B MoE model on an 8GB laptop GPU, covering effective optimizations like --no-mmap and VRAM headroom, unexpected findings where speculative decoding improved speed by 26% contrary to benchmarks, and pitfalls with Windows and CPU bottlenecks.
Author shares a working llama-server config to run the 35B-MoE Qwen3.6 model on an 8GB RTX 4060, highlighting a max_tokens trap caused by unconstrained internal reasoning and the fix using per-request thinking_budget_tokens.