Local LLM autocomplete + agentic coding on a single 16GB GPU + 64GB RAM

Reddit r/LocalLLaMA 05/12/26, 02:53 PM Tools

local-llm autocomplete agentic-coding qwen llama-cpp gguf gpu-setup

Summary

A technical guide on setting up local LLM autocomplete (Qwen2.5-Coder-7B) and agentic coding (Qwen3.6-35B-A3B) on a single 16GB GPU with 64GB+ RAM using llama.cpp, including commands and performance benchmarks.

Today I set up a full coding toolbox on a single RTX 5080 (with RAM offloading) that's actually viable. **Autocomplete**: bartowski/Qwen2.5-Coder-7B-Instruct-GGUF:Q6_K_L **Agentic**: unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q8_K_XL --- ### Why these models: Qwen2.5 is still the best model for infill imo. I tried Gemma4 E4B and Qwen3.5 9B/4B and both produce weird suggestions. This autocomplete model takes ~8GB VRAM using the command below. The speed of suggestions is basically instant. Qwen3.6 35B-A3B is actually good at agentic coding at Q8 if you give it a good prompt. At Q4 it's not usable tbh and gets lost a lot, but at Q8 it can figure stuff out and actually finish its work correctly. If you don't have a lot of RAM for MoE experts, try Q6_K, but lower quants have noticable quality issues. You probably need 64GB total RAM minimum. I have 96 but with both models running and a bunch of random stuff open (browser, IDE, Teams) I'm at 56GB used. Because it has 3B active params, it's still fast and fits into the remaining 8GB VRAM. --- ### Commands: ```bash llama-server -hf bartowski/Qwen2.5-Coder-7B-Instruct-GGUF:Q6_K_L \ -ngl 99 --no-mmap --ctx-size 0 -ctk q8_0 -ctv q8_0 \ -np 1 --temp 0.5 --top-p 0.95 --top-k 20 --min-p 0.0 --port 8081 ``` Note: I actually have no idea which hyperparameters to use for Qwen2.5, maybe someone will enlighten me and I'll edit the post. ```bash llama-server -hf unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q8_K_XL \ --no-mmap --no-mmproj -fitt 0 -ngl 99 --cpu-moe \ -b 2048 -ub 2048 --jinja \ --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.01 ``` llama.cpp autofits the model and I get ~145k context with this command. You can use `-ctv q8_0 -ctk q8_0` if you want more context. 35B-A3B speed with this setup: ``` pp4096 | 2093.93 ± 22.64 tg128 | 35.29 ± 0.48 ```

Original Article

Local LLM autocomplete + agentic coding on a single 16GB GPU + 64GB RAM

Similar Articles

Help optimizing llama.cpp + Qwen 27B on RTX PRO 6000 Blackwell for coding agents

Running Qwen3.6 35b a3b on 8gb vram and 32gb ram ~190k context

I benchmarked 21 local LLMs on a MacBook Air M5 for code quality AND speed

Best config for Qwen3.6 27b / llama.cpp / opencode

@TraffAlex: Best Local LLMs for Consumer GPUs — llama.cpp Guide (June 2026) What I actually run on consumer hardware right now. Eve…

Submit Feedback

Similar Articles

Help optimizing llama.cpp + Qwen 27B on RTX PRO 6000 Blackwell for coding agents

Running Qwen3.6 35b a3b on 8gb vram and 32gb ram ~190k context

I benchmarked 21 local LLMs on a MacBook Air M5 for code quality AND speed

Best config for Qwen3.6 27b / llama.cpp / opencode
Community thread sharing optimized llama.cpp launch commands for running the 27B Qwen3.6 GGUF model with long 100K-512K context on multi-GPU setups.

@TraffAlex: Best Local LLMs for Consumer GPUs — llama.cpp Guide (June 2026) What I actually run on consumer hardware right now. Eve…