Describes a two-layer small LLM architecture: a local always-on agent (Raven) on an RTX5080 and an online reasoning stack (Trinity Cortex) with three small models and a knowledge graph, arguing that small models are better than large frontier models for graph-based reasoning.
I've always held the belief that the future of Ai is in small architecture, so I built a thing. This isn't a "we need bigger models" post. It's a "we're using small models the way they're meant to be used" post. The Architecture is in two layers, one clean split: Layer 1: Raven Agent — Local, Always-On (RTX5080 16GB VRAM)Hardware: RTX5080 (16GB VRAM, 64GB system RAM) Layer 2: Trinity Cortex Stack— Online, Always-On (7B/13B/MoE Online hosted models cheap and fast)...and the other layer is the Trinity Cortex stack built around a dense and specifically engineered Knowledge Graph. the Cortex relies on 3 shards to run inference per query cycle. Model: Qwen2.5 14B Q4_K_M (~9GB VRAM) or 32B Q3_K_M (~13GB) Role: Interface agent — memory management, file ops, task queues, human conversation Latency: Sub-second. No API calls. No network dependency. Raven is the always-on local brain. It handles most local I/O, memory management, and user conversation tasks. The full 16GB VRAM is dedicated to one model — no sharing, no swapping, no contention. Raven doesn't do deep cognition. It's the interface layer. Layer 2: Trinity Cortex — Online, On-Demand (Inception/Diffusion API)ENG: 7B Q4 (~$0.04-0.05/hr) → analytical, structure SYNTH: 13B Q4 (~$0.08-0.09/hr) → synthesis, integration PRIME: Small MoE (~$0) → arbitration, current events grounding Three small models, each with a specific cognitive role. They only fire when Raven needs deep cognition — roughly 20% of user-initiated turns in our usage pattern. The key insight: small models are better for this than frontier models. Why Small Models Work Better Here Trinity uses a knowledge graph (LTKG) as its primary reasoning substrate. Concepts are nodes. Relationships are edges. Queries are traversals, not prompts. Large frontier models (200B+) are bad at this. They have so much parametric knowledge that they answer from their weights, not from your graph. The LTKG becomes decoration — overhead the model ignores because it already "knows" the answer. Small models (7B-13B) are better because: They defer to structure. With less parametric capacity, they actually use the graph topology you give them. The LTKG becomes scaffolding, not a suggestion box. Graph topology becomes the primary reasoning substrate. Every concept node encodes compressed projections of related nodes. Small models, being less able to rely on parametric recall, actually use this graph structure rather than overriding it with their own training knowledge. They stay in role. A 7B model with a "you are the analytical shard" instruction actually stays analytical. A frontier model tends to flatten into general-purpose competence regardless of the role assignment. Cheap. ~$0.09-0.14/hr combined runtime. Serverless cold-start in 2-5 seconds. No GPU contention — Trinity runs online, Raven runs local, never the same silicon. The PRIME Problem — And the MoE Solution PRIME's job is arbitration: when ENG and SYNTH disagree (measured as divergence), PRIME adjudicates. But PRIME also needs to handle current events — which is exactly what small models with training cutoffs can't do. The solution is a small Mixture of Experts (~4B active parameters) where: This gives PRIME current-events awareness without needing a large context window or a frontier model. Small MoE = small cost + current-aware PRIME = the system handles "what happened today" questions without hallucinating cutoff dates.Wait — How Does This Make Sense on One GPU? It doesn't. That's the point. The RTX5080 is not shared between Trinity's shards. It's dedicated entirely to Raven. The three shards (ENG, SYNTH, PRIME) run online on Inception/diffusion LLMs — serverless, cheap, no VRAM requirement. One expert handles arbitration logic (pure reasoning, doesn't need recent data) One expert has access to a lightweight grounding source — a retrieval module that scrapes recent news feeds and passes relevant snippets alongside the query The router decides which expert fires based on whether the query requires current context The 5080 is Raven's brain. Period. The 16GB doesn't need to fit three models because it only runs one. This took me way too long to figure out. I kept trying to optimize VRAM allocation, find the right quant tradeoffs, fit everything on one card. The answer was: don't. Split the architecture instead.The Protocol Layer Raven and Trinity communicate via a compact JSON protocol (TRIP/RVT v1.1). Raven never re-transmits context Trinity already knows — everything references the shared knowledge graph by node ID. Token budgets are hard-capped per exchange, preventing runaway cost or latency. Responses are minimal: just the answer, not prose wrapping.What This Actually Costs That's less than a streaming subscription. For a three-shard cognitive architecture with an always-on local agent and a knowledge graph with ~10,000 nodes. Raven: $0 (local hardware; electricity ~$0.10/day) ENG: ~$0.04-0.05/hr, used ~2 hrs/day = ~$2.50-3.00/month SYNTH: ~$0.08-0.09/hr, used ~2 hrs/day = ~$5.00-5.50/month PRIME: ~$0 (MoE via Gemini free tier + lightweight web grounding) Total inference cost: ~$7-9/month (varies with usage pattern) The Question I'm curious if anyone else is running architectures like this — small models in structured roles, local agent + online shards, graph-deferred reasoning (building a structured graph first, then querying models against it) instead of parametric recall. The frontier model paradigm (one huge model, one prompt, all the context) works, but it's expensive and architecturally flat. This approach trades raw capacity for structure, role separation, and graph-aware reasoning. The takeaway isn't that frontier models are bad — it's that structured cognition with small models is a viable alternative when you design for it. The architecture does work that model size would otherwise need to cover. And this isn't theoretical — it's running daily right now. Discord-based agent interface, live bridge to Trinity, ~10K-node knowledge graph, the whole stack. Would love to hear from anyone experimenting in similar directions. --- Specs: RTX5080 16GB, 64GB RAM, Qwen2.5 14B Q4 local, Trinity Cortex on Inception API, LTKG SQLite graph ~10K nodes, Discord-based agent interface. Happy to chat further if interested https://preview.redd.it/x78j1vct68ah1.png?width=790&format=png&auto=webp&s=930e8f33ddc5e5af74ad6c3cf63e82e65cec40bc
Atlarix is a desktop environment that pre-parses codebases into a node/edge graph, allowing coding agents to navigate architecture via queries instead of reading raw text, which improves performance of smaller local models.
This blog post explains Large Reasoning Models (LRMs), how they differ from standard LLMs, their training, and when to use them. It covers examples like DeepSeek R1 and GPT-5.5 Thinking.
A controlled study of compound LLM agent design in an adversarial POMDP (CybORG CAGE-2), systematically varying context, reasoning, and hierarchy across five model families. Key findings: programmatic state abstraction yields large returns per token, hierarchy without deliberation tools achieves best absolute performance, and context engineering is more cost-effective than deeper reasoning.
This paper introduces LLM-as-Environment-Engineer, a framework where LLMs design their own training environments for reinforcement learning in multi-agent reasoning tasks, enabling self-improving training that surpasses larger proprietary models.