An agent that plans with a frontier model but runs most of tokens locally (built it for my own dual-3090 rig)

Reddit r/LocalLLaMA Tools

Summary

The author built a personal AI agent that uses a frontier model (Codex) for high-level planning while running most token processing locally on a dual RTX 3090 system, enabling long-duration tasks with deterministic validation. The agent supports three swappable tiers: planner, local, and senior, and is available as an open-source repository.

For the past couple of months, I've been building a tool for my personal use. I have a dual RTX 3090 system which I wanted to use but the qwen 3.5/3.6 27B and Gemma 4 31B while being really good, just didn't have the taste or the ability that a frontier model has. OTOH, frontier models are expensive and I didn't want everything I do running through them. I wanted the best of both worlds: frontier reasoning for the plan, local models doing almost all the actual work. I have tried a few repos which do enable small models to perform above their weight by 'calling' frontier models, but that's not what I wanted. I want to be able to plan with the frontier model as my experience in software engineering over the last decade+ has taught me that design is the bottleneck in most projects and prevents spaghetti code/rewrites. I created an agent and it took a lot of iterations but now I believe I have one and I'm using it for my personal use. The crux of the agent is like this (it uses a lot of existing tools, no reinventing the wheel). But it's all customizable. 3 Tiers, all swappable with config file: * Planner: Codex (extremely powerful; though anything that emits the decision JSON works here) * Local: Qwen 3.6 27B (Great for agentic use and tool calling, good enough for coding) * Senior (optional): Kimi K2.6 via opencode-go (When the local fails and retry attempts get exhausted) You can have all 3 tiers local, 2 tiers local, one frontier one local or any combination. This is just what I found to work best. Every task goes to codex, which can map it to N phases. Say a big coding task will usually map to 3 phases (research, implement, review). Similarly a review task will also go into phases (review, artifact). Each phase can also grind for multiple epochs, each epoch will give out tasks which the local models do (and do very well), all this is planned by codex. The biggest differentiation is deterministic validation. A task only counts as done when a check actually passes, i.e. a command exits 0 or the file it was supposed to produce exists. The state machine re-runs those checks itself instead of trusting what the model says it did, so a multi-hour chain can't drift by claiming progress it never made. I've found that this can enable local models to be much more capable than otherwise: 1. Enables them to do tasks which span hours and hours 2. Taste and capability of frontier model, but \~85-90% (based on my measurement) of tokens go through local models. For output tokens it's \~95%. 3. Context isolation, prevents context rot and the frontier model is much cheaper because the context window doesn't overflow with bash calls. 4. Also does some useful stuff by default: uses a repomapper to map the repo as a graph, and curates context fairly aggressively so the local models aren't drowning in irrelevant files. It's still WIP but finally it's in a stage where it's usable. So was wondering if y'all would like to try it (repo in first comment) Things that are messy: Installation: Not very clean. I use a bunch of existing open source software like pi, opencode etc. No UI: It's just a shell command with a simple TUI showing status updates. You need to create your own job.md file (or have an agent create one)
Original Article

Similar Articles

We have sub-agents at home

Reddit r/LocalLLaMA

A developer shares a forked sub-agent repository for pi coding agent that works with a single local LLM slot and limited VRAM, using llama.cpp server and quantized models. The post also discusses performance with the Apex Qwen variant using MTP.

AI agents still suck, so I built my own

Reddit r/AI_Agents

The author built a custom AI agent application wrapping Claude Code and upcoming Codex support, focusing on composable workflows and seeking community feedback.

@vintcessun: It turns out that having multiple AI agents work together as a team is better than a single general-purpose agent in this way: each role is bound to its best model, memory and skills accumulate across conversations. Instead of taking turns, a task is handed off with a brief handover note. Runs locally, all file states are in ~/.crew44, free MIT license.

X AI KOLs Timeline

Crew44 is a local-first orchestrator that turns coding agents like Claude Code and Codex into a coordinated team of specialists, each bound to its best model, with persistent memory and skill accumulation across sessions. It runs entirely on your machine with no cloud dependence and is free under MIT license.