The author built a personal AI agent that uses a frontier model (Codex) for high-level planning while running most token processing locally on a dual RTX 3090 system, enabling long-duration tasks with deterministic validation. The agent supports three swappable tiers: planner, local, and senior, and is available as an open-source repository.
For the past couple of months, I've been building a tool for my personal use. I have a dual RTX 3090 system which I wanted to use but the qwen 3.5/3.6 27B and Gemma 4 31B while being really good, just didn't have the taste or the ability that a frontier model has. OTOH, frontier models are expensive and I didn't want everything I do running through them. I wanted the best of both worlds: frontier reasoning for the plan, local models doing almost all the actual work. I have tried a few repos which do enable small models to perform above their weight by 'calling' frontier models, but that's not what I wanted. I want to be able to plan with the frontier model as my experience in software engineering over the last decade+ has taught me that design is the bottleneck in most projects and prevents spaghetti code/rewrites. I created an agent and it took a lot of iterations but now I believe I have one and I'm using it for my personal use. The crux of the agent is like this (it uses a lot of existing tools, no reinventing the wheel). But it's all customizable. 3 Tiers, all swappable with config file: * Planner: Codex (extremely powerful; though anything that emits the decision JSON works here) * Local: Qwen 3.6 27B (Great for agentic use and tool calling, good enough for coding) * Senior (optional): Kimi K2.6 via opencode-go (When the local fails and retry attempts get exhausted) You can have all 3 tiers local, 2 tiers local, one frontier one local or any combination. This is just what I found to work best. Every task goes to codex, which can map it to N phases. Say a big coding task will usually map to 3 phases (research, implement, review). Similarly a review task will also go into phases (review, artifact). Each phase can also grind for multiple epochs, each epoch will give out tasks which the local models do (and do very well), all this is planned by codex. The biggest differentiation is deterministic validation. A task only counts as done when a check actually passes, i.e. a command exits 0 or the file it was supposed to produce exists. The state machine re-runs those checks itself instead of trusting what the model says it did, so a multi-hour chain can't drift by claiming progress it never made. I've found that this can enable local models to be much more capable than otherwise: 1. Enables them to do tasks which span hours and hours 2. Taste and capability of frontier model, but \~85-90% (based on my measurement) of tokens go through local models. For output tokens it's \~95%. 3. Context isolation, prevents context rot and the frontier model is much cheaper because the context window doesn't overflow with bash calls. 4. Also does some useful stuff by default: uses a repomapper to map the repo as a graph, and curates context fairly aggressively so the local models aren't drowning in irrelevant files. It's still WIP but finally it's in a stage where it's usable. So was wondering if y'all would like to try it (repo in first comment) Things that are messy: Installation: Not very clean. I use a bunch of existing open source software like pi, opencode etc. No UI: It's just a shell command with a simple TUI showing status updates. You need to create your own job.md file (or have an agent create one)
A developer shares a forked sub-agent repository for pi coding agent that works with a single local LLM slot and limited VRAM, using llama.cpp server and quantized models. The post also discusses performance with the Apex Qwen variant using MTP.
The author built a custom AI agent application wrapping Claude Code and upcoming Codex support, focusing on composable workflows and seeking community feedback.
Crew44 is a local-first orchestrator that turns coding agents like Claude Code and Codex into a coordinated team of specialists, each bound to its best model, with persistent memory and skill accumulation across sessions. It runs entirely on your machine with no cloud dependence and is free under MIT license.
A reminder that two RTX 3090s and open-source models like Qwen 3.6 27B or Gemma 4 31B can run powerful local AI agents, comparable to Opus 4.5, using tools like Claude Code and self-hosted SearXNG.
A detailed guide on setting up custom subagents for Codex, using a grid of six generic agents with varying effort and permission levels, plus a mission card pattern for efficient task delegation.