@leopardracer: https://x.com/leopardracer/status/2055341758523883631

X AI KOLs Timeline Tools

Summary

A user shares their experience setting up a dual-GPU local AI lab with RTX 4080 Super and 5060 Ti, running Qwen 3.6 models via llama.cpp and llama-swap to reduce API costs and enable unrestricted experimentation.

https://t.co/EnAVMcfxG4
Original Article
View Cached Full Text

Cached at: 05/16/26, 07:14 AM

This 2-GPU Setup Killed My OpenAI Bill

**Hello everyone, leopardracer here!

**I have two GPUs in my desktop because I like having a local AI lab under my desk.

That is the honest reason.

Not because I had a $500 API bill. Not because I am trying to replace every hosted model. Not because I think everyone should build a mini data center at home.

I wanted to run large models locally, connect them to my tools, and experiment without thinking about usage meters. If an agent loops 400 times, I kill the process. If a prompt is terrible, I rewrite it. If a model hallucinates, I swap it out.

No rate limits. No API keys. No prompts leaving my machine.

That changes how you use AI. You stop treating every request like a metered transaction and start treating the system like a workshop.

This is my current setup.

My Actual Hardware

This is a desktop, not a server rack.

GPU 0: NVIDIA GeForce RTX 4080 SUPER, 16GB GDDR6X

GPU 1: NVIDIA GeForce RTX 5060 Ti, 16GB GDDR6

CPU: AMD Ryzen 9 9900X, 12 cores / 24 threads

RAM: 64GB DDR5-5600

OS: CachyOS, Arch-based rolling Linux

Desktop: Hyprland on Wayland

The 4080 SUPER is the main card. The 5060 Ti is the experiment card. Together they give me 32GB of VRAM, which is enough to run 30B-class models with useful context if you are willing to use quantization.

The point is not that everyone needs two GPUs. A single 24GB card is simpler. A used 3090 is probably the most practical local LLM card for a lot of people. But two 16GB cards create a different kind of playground.

I can split one larger model across both cards. I can run one model for coding and another for comparison. I can dedicate one GPU to text and keep the other available for vision, embeddings, or whatever weird experiment I want to try next.

That flexibility is the fun part.

Why Qwen 3.6

Qwen 3.6 is the first Qwen release where local coding agents started to feel genuinely interesting to me.

My daily driver is Qwen3.6-27B, the dense model. I also keep Qwen3.6-35B-A3B around as an experiment, but the 27B dense model is the one I actually reach for first.

That distinction matters.

From the Qwen technical report numbers I track:

Qwen3.6-27B:

• SWE-bench Verified: 77.2%

• SWE-bench Pro: 53.5%

• Terminal-Bench 2.0: 59.3%

• SkillsBench average: 48.2%

Qwen3.6-35B-A3B:

• SWE-bench Verified: 73.4%

• SWE-bench Pro: 49.5%

• Terminal-Bench 2.0: 51.5%

• QwenClawBench: 52.6%

Claude Opus 4.7 is still the best generally available model in my benchmark notes: 87.6% on SWE-bench Verified, 64.3% on SWE-bench Pro, and 69.4% on Terminal-Bench 2.0. I am not pretending local Qwen beats frontier APIs.

The point is different: local Qwen is good enough to be worth experimenting with, and it runs on hardware I own.

The other win is not maximum context size. Hosted models already push 200K+ tokens, and some go much higher.

The win is that I can abuse long context locally without thinking about the meter. I can feed in messy transcripts, codebases, research notes, or scraped articles, watch where the model breaks, change the prompt, and try again.

APIs can do long context too. But when you are experimenting with agents, long context gets expensive fast. Locally, the constraint is VRAM and patience.

The Stack

Three pieces make this usable.

llama.cpp runs the model. It handles CUDA, GGUF files, quantization, KV cache settings, Flash Attention, and token generation.

llama-swap sits in front of llama.cpp. It exposes OpenAI-compatible endpoints and hot-swaps models on demand. I can route one request to Qwen, another to Gemma, and let unused models unload after a short TTL.

Local agents and coding tools call those endpoints. Some are personal scripts. Some are agent configs. The important part is that they talk to localhost instead of a hosted API.

That means my local model can become infrastructure. Not just a chat window.

My llama-swap Config

Here is the actual llama-swap config pattern I use for larger local models. The same local endpoint setup is what lets my agents call the 27B dense daily driver without API keys.

The important settings are easy to miss.

CUDA_VISIBLE_DEVICES=0,1 tells llama.cpp to use both GPUs. With two 16GB cards, this is the difference between “interesting” and “why does everything OOM?”

-ngl 99 tries to offload the model layers to GPU. For this setup, I want the GPUs doing as much work as possible.

--cache-type-k q8_0 --cache-type-v q4_0 quantizes the KV cache. This matters more as context grows. Long context is not free. The model weights use VRAM, but the context cache also grows as you feed more tokens. Quantizing that cache is how I make 131K context practical.

--flash-attn on enables Flash Attention. On modern NVIDIA hardware, this is one of those flags that feels boring until you turn it off and realize how much it was helping.

--chat-template-kwargs '{"enable_thinking":true}' enables Qwen’s thinking mode. That is useful for local agents because I want to inspect why the model made a decision, not just the final answer.

Local Agents Are the Fun Part

Running a local chat model is nice. Running local agents is better.

I have a Scout agent that monitors Hacker News, Reddit, Twitter, GitHub Trending, Hugging Face, and RSS feeds. It scores items, extracts the interesting ones, and drops drafts into my Content/Insights/drafts folder.

The LLM config is intentionally boring:

I also have a GPU tracker that watches RTX 5090 pricing and emails me when cards hit my thresholds.

That api_key: "not-needed" line is the whole vibe.

When an agent runs locally, I can be reckless in useful ways. I can over-sample. I can retry. I can log everything. I can test terrible prompts against real workflows. If Scout decides every article is important, I fix the rubric. If the GPU tracker writes a bad summary, I change the prompt.

The failure mode is learning, not a bigger invoice.

What 32GB VRAM Actually Buys You

32GB of VRAM does not make local AI effortless. It makes it flexible.

With this setup I can run Qwen3.6-27B dense as my daily driver, keep Qwen3.6-35B-A3B around for experiments, and run Gemma 4 26B for comparison. I can test context sizes that would feel annoying through an API. I can run agents repeatedly without checking a dashboard.

The best part is trying things that would be too dumb to pay for.

What happens if I feed it 100K tokens of messy notes and ask for contradictions?

What happens if Scout scores 30 Hacker News posts with a stricter rubric?

What happens if I ask the model to inspect a repo and generate five competing refactor plans?

Some experiments work. Some are garbage. That is fine. Local inference makes garbage cheap.

Privacy Is the Serious Benefit

The fun part is experimentation. The serious part is data ownership.

Every prompt stays on my machine. Every response is generated locally. My code, notes, family questions, health logs, and half-formed ideas do not have to leave my desktop just because I want an AI assistant to look at them.

This is not a conspiracy theory about hosted AI companies. It is a simpler claim: if the model runs locally, I do not have to trust a vendor with that prompt.

That matters for personal operating systems. It matters for private code. It matters for health questions. It matters for any note you would hesitate to paste into a web app.

The privacy benefit is not dramatic. It is quiet. You stop thinking, “Should I send this?” because the answer is no. It never leaves.

Setup Path

If you want to try this, start smaller than you think.

You do not need my exact hardware. You need a CUDA-capable NVIDIA GPU, enough VRAM for the model you want, and enough patience to debug CUDA once or twice.

For 7B to 14B models, 12GB to 16GB VRAM is plenty.

For 27B to 35B models, 24GB is comfortable. Two 16GB cards can work if you split the model across both.

For long context, get 64GB system RAM. VRAM gets the attention, but system RAM matters once you start juggling models, agents, browsers, and a desktop environment.

Build llama.cpp with CUDA:

Download a GGUF quantization for the model you want. I use Unsloth’s UD quantizations because they have been practical for these larger models.

Install llama-swap:

Then test the endpoint:

If you get a response, point any OpenAI-compatible client at that base URL and start experimenting.

Troubleshooting Notes

CUDA out of memory: Reduce context first. Long context is usually the culprit. If that is not enough, use more aggressive KV cache quantization.

Only one GPU is active: Check CUDA_VISIBLE_DEVICES=0,1 and watch nvidia-smi while the model loads. Do not wait until generation starts. You want to see memory allocated on both cards during load.

Slow generation: Confirm the model layers are actually on GPU. If -ngl is too low or CUDA did not build correctly, you may be running more on CPU than you think.

Model will not load: Update llama.cpp and rebuild. New GGUFs and chat templates often need recent llama.cpp changes.

Answers are weird: Check the chat template before blaming the model. Local models are sensitive to prompt formatting. A wrong template can make a good model look broken.

What This Setup Cannot Do

It cannot replace every hosted model.

For hard reasoning, I still use frontier APIs when quality matters more than locality. Claude and GPT-class models are better for some tasks. That is fine.

It cannot do live web search unless you give it tools. Local models do not magically know what happened this morning.

It cannot generate images. Use ComfyUI or another image stack for that.

It is also not silent. Two GPUs under load make noise. Fans go brrr.

Own Your Lab

My desktop now runs models that would have felt absurdly capable a few years ago. That is the interesting story.

Not that I optimized a bill. Not that I found a secret replacement for every API. The interesting part is that consumer hardware plus open models now lets one person build a private AI lab at home.

You can inspect it. You can break it. You can wire it into agents. You can feed it private data without sending that data anywhere. You can learn the system by touching the whole stack.

That is why I run Qwen locally.

Because renting AI is useful.

Owning a little piece of it is more fun.

If you are running local models, I want to hear about your setup. What hardware? What model? What broke first?

Hit the heart below and share this with someone who thinks AI has to be expensive.

Similar Articles

RTX Pro 4500 Blackwell - Qwen 3.6 27B?

Reddit r/LocalLLaMA

A developer shares local inference benchmarks and systemd configurations for running the Qwen3.6-27B model on an NVIDIA RTX Pro 4500 Blackwell GPU using llama.cpp. The post requests optimization tips for throughput and explores potential use cases for larger models.