★ Follow-up to "Blaming the model won't fix your workflow": the paper is now a preprint. The real learnings: composable domains, a verification ratchet, and tool naming.
The author announces a preprint on building an agentic system with composable domains, a verification ratchet, and careful tool naming, sharing practical lessons and an open-source Common Lisp implementation.
A month ago I posted the very rough beginnings of a paper. That rough version did not survive: it got pulled apart and rebuilt by the very process it describes, and what came out the other side is now a proper preprint with a DOI: https://doi.org/10.5281/zenodo.21139628. Short version: the core claim held. The artifacts (specs, plans, executable graphs) and the verification gates wrapped around them have proven out on real work. Agents produce the work, the gates catch the defects, and a milestone only closes when the evidence is real, not when the model announces it is done. Honestly, though, the headline result was not the most valuable thing I got out of building it. What I actually want to pass on is three things I learned making it work. The first was composable domains. A "domain" in my setup is a bundle of instructions, skills, and tool access you hand an agent for a class of task. I built the first few as one-offs. Once I redesigned them to compose (stack cleanly, assume nothing about each other) they started turning up useful in places I had not planned for. A domain written for one workflow dropped straight into two others unchanged, and the same pattern is now carrying an entirely separate application build. Designing for composition instead of single use is the thing I would do first next time. The second was the ratchet, and it needs a concrete example. An agent once delivered a load test asserting the record count was greater than or equal to zero. Green forever, catches nothing, and it looks completely normal in review. So the loop now runs like this: acceptance criteria are written before the code exists, the coding agent never writes tests at all, a fresh session verifies the code against those criteria, only then does another session derive regression tests from them, and a final step breaks the code on purpose to confirm each test can actually fail. A test that survives that is frozen, and later work runs against it and cannot silently undo it. Standards move one way only. That killed a whole class of "looks done, isn't." The third was dumber and more surprising: tool naming matters far more than it should. An agent routes off a tool's name, and the name drags the model's training priors with it. What fixed things was never cleverness: borrow names from tools the model already knows, mirror the built-in parameter vocabulary exactly (renaming one parameter from `code` to `content` ended a whole class of thrashing), and never let a familiar name lie about what the tool does. The kicker: a strong model absorbs a bad interface and hides the problem from you, so test your tool surface with the weakest model that can still do the work. Everything above runs as an open reference implementation: the orchestrator, the verification cycle, the composable domains. To set expectations, this is not another 180-line agent loop. It is the third generation of a design that got ground out until it was useful rather than until it was postable, and it has only recently earned daily-driver status. It also passes the dogfood test, since the system's own development runs through its own gates, and the deepest bugs it ever caught were in itself. Fair warning before you click: it is Common Lisp. https://gitlab.com/naive-x/experimental/cl-naive-full-stack-agentic-system Preprint is here if you want the formal version: https://doi.org/10.5281/zenodo.21139628. Happy to take questions. And one worth asking of any agent-written suite: when did a test last fail because it caught wrong code? I could not answer that for mine, and that is where all of this started.
A white paper that identifies 24 failure modes in AI agent workflows and proposes a structural enforcement architecture with three-layer enforcement, task graphs, and verification, along with a reference implementation in Common Lisp.
This paper proposes principled approaches for designing and optimizing practical agentic LLM systems, introducing a framework with pseudo-tools and fixed workflows to improve modularity, cost-efficiency, and accuracy across diverse tasks.
New paper proposes self-compacting language model agents that can decide when to clean up their own traces of reasoning and tool calls to avoid accumulating mistakes and stale information.
This paper introduces Regimes, an auditable, held-out-gated improvement loop built on the ActiveGraph runtime for self-improving agents. It demonstrates modest improvements on the LongMemEval dataset by autonomously discovering prompt repairs that pass static checks, sandbox execution, and held-out validation.
A practical guide arguing that mastering sub-agents requires building four specific workflows in a weekend, covering decomposition, context packaging, verification, and cost control, rather than spending 200 hours on tutorials.